| 10 10 30 30 30 30 30 30 30 30 30 30 29 30 30 29 9 9 9 9 9 9 30 30 30 30 20 20 10 10 10 9 10 10 30 20 20 20 10 10 9 10 10 10 10 10 10 10 10 10 30 10 30 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 | // SPDX-License-Identifier: GPL-2.0 /* * linux/fs/ext4/readpage.c * * Copyright (C) 2002, Linus Torvalds. * Copyright (C) 2015, Google, Inc. * * This was originally taken from fs/mpage.c * * The ext4_mpage_readpages() function here is intended to * replace mpage_readahead() in the general case, not just for * encrypted files. It has some limitations (see below), where it * will fall back to read_block_full_page(), but these limitations * should only be hit when page_size != block_size. * * This will allow us to attach a callback function to support ext4 * encryption. * * If anything unusual happens, such as: * * - encountering a page which has buffers * - encountering a page which has a non-hole after a hole * - encountering a page with non-contiguous blocks * * then this code just gives up and calls the buffer_head-based read function. * It does handle a page which has holes at the end - that is a common case: * the end-of-file on blocksize < PAGE_SIZE setups. * */ #include <linux/kernel.h> #include <linux/export.h> #include <linux/mm.h> #include <linux/kdev_t.h> #include <linux/gfp.h> #include <linux/bio.h> #include <linux/fs.h> #include <linux/buffer_head.h> #include <linux/blkdev.h> #include <linux/highmem.h> #include <linux/prefetch.h> #include <linux/mpage.h> #include <linux/writeback.h> #include <linux/backing-dev.h> #include <linux/pagevec.h> #include "ext4.h" #define NUM_PREALLOC_POST_READ_CTXS 128 static struct kmem_cache *bio_post_read_ctx_cache; static mempool_t *bio_post_read_ctx_pool; /* postprocessing steps for read bios */ enum bio_post_read_step { STEP_INITIAL = 0, STEP_DECRYPT, STEP_VERITY, STEP_MAX, }; struct bio_post_read_ctx { struct bio *bio; struct work_struct work; unsigned int cur_step; unsigned int enabled_steps; }; static void __read_end_io(struct bio *bio) { struct folio_iter fi; bio_for_each_folio_all(fi, bio) folio_end_read(fi.folio, bio->bi_status == 0); if (bio->bi_private) mempool_free(bio->bi_private, bio_post_read_ctx_pool); bio_put(bio); } static void bio_post_read_processing(struct bio_post_read_ctx *ctx); static void decrypt_work(struct work_struct *work) { struct bio_post_read_ctx *ctx = container_of(work, struct bio_post_read_ctx, work); struct bio *bio = ctx->bio; if (fscrypt_decrypt_bio(bio)) bio_post_read_processing(ctx); else __read_end_io(bio); } static void verity_work(struct work_struct *work) { struct bio_post_read_ctx *ctx = container_of(work, struct bio_post_read_ctx, work); struct bio *bio = ctx->bio; /* * fsverity_verify_bio() may call readahead() again, and although verity * will be disabled for that, decryption may still be needed, causing * another bio_post_read_ctx to be allocated. So to guarantee that * mempool_alloc() never deadlocks we must free the current ctx first. * This is safe because verity is the last post-read step. */ BUILD_BUG_ON(STEP_VERITY + 1 != STEP_MAX); mempool_free(ctx, bio_post_read_ctx_pool); bio->bi_private = NULL; fsverity_verify_bio(bio); __read_end_io(bio); } static void bio_post_read_processing(struct bio_post_read_ctx *ctx) { /* * We use different work queues for decryption and for verity because * verity may require reading metadata pages that need decryption, and * we shouldn't recurse to the same workqueue. */ switch (++ctx->cur_step) { case STEP_DECRYPT: if (ctx->enabled_steps & (1 << STEP_DECRYPT)) { INIT_WORK(&ctx->work, decrypt_work); fscrypt_enqueue_decrypt_work(&ctx->work); return; } ctx->cur_step++; fallthrough; case STEP_VERITY: if (ctx->enabled_steps & (1 << STEP_VERITY)) { INIT_WORK(&ctx->work, verity_work); fsverity_enqueue_verify_work(&ctx->work); return; } ctx->cur_step++; fallthrough; default: __read_end_io(ctx->bio); } } static bool bio_post_read_required(struct bio *bio) { return bio->bi_private && !bio->bi_status; } /* * I/O completion handler for multipage BIOs. * * The mpage code never puts partial pages into a BIO (except for end-of-file). * If a page does not map to a contiguous run of blocks then it simply falls * back to block_read_full_folio(). * * Why is this? If a page's completion depends on a number of different BIOs * which can complete in any order (or at the same time) then determining the * status of that page is hard. See end_buffer_async_read() for the details. * There is no point in duplicating all that complexity. */ static void mpage_end_io(struct bio *bio) { if (bio_post_read_required(bio)) { struct bio_post_read_ctx *ctx = bio->bi_private; ctx->cur_step = STEP_INITIAL; bio_post_read_processing(ctx); return; } __read_end_io(bio); } static inline bool ext4_need_verity(const struct inode *inode, pgoff_t idx) { return fsverity_active(inode) && idx < DIV_ROUND_UP(inode->i_size, PAGE_SIZE); } static void ext4_set_bio_post_read_ctx(struct bio *bio, const struct inode *inode, pgoff_t first_idx) { unsigned int post_read_steps = 0; if (fscrypt_inode_uses_fs_layer_crypto(inode)) post_read_steps |= 1 << STEP_DECRYPT; if (ext4_need_verity(inode, first_idx)) post_read_steps |= 1 << STEP_VERITY; if (post_read_steps) { /* Due to the mempool, this never fails. */ struct bio_post_read_ctx *ctx = mempool_alloc(bio_post_read_ctx_pool, GFP_NOFS); ctx->bio = bio; ctx->enabled_steps = post_read_steps; bio->bi_private = ctx; } } static inline loff_t ext4_readpage_limit(struct inode *inode) { if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) return inode->i_sb->s_maxbytes; return i_size_read(inode); } int ext4_mpage_readpages(struct inode *inode, struct readahead_control *rac, struct folio *folio) { struct bio *bio = NULL; sector_t last_block_in_bio = 0; const unsigned blkbits = inode->i_blkbits; const unsigned blocks_per_page = PAGE_SIZE >> blkbits; const unsigned blocksize = 1 << blkbits; sector_t next_block; sector_t block_in_file; sector_t last_block; sector_t last_block_in_file; sector_t first_block; unsigned page_block; struct block_device *bdev = inode->i_sb->s_bdev; int length; unsigned relative_block = 0; struct ext4_map_blocks map; unsigned int nr_pages, folio_pages; map.m_pblk = 0; map.m_lblk = 0; map.m_len = 0; map.m_flags = 0; nr_pages = rac ? readahead_count(rac) : folio_nr_pages(folio); for (; nr_pages; nr_pages -= folio_pages) { int fully_mapped = 1; unsigned int first_hole; unsigned int blocks_per_folio; if (rac) folio = readahead_folio(rac); folio_pages = folio_nr_pages(folio); prefetchw(&folio->flags); if (folio_buffers(folio)) goto confused; blocks_per_folio = folio_size(folio) >> blkbits; first_hole = blocks_per_folio; block_in_file = next_block = (sector_t)folio->index << (PAGE_SHIFT - blkbits); last_block = block_in_file + nr_pages * blocks_per_page; last_block_in_file = (ext4_readpage_limit(inode) + blocksize - 1) >> blkbits; if (last_block > last_block_in_file) last_block = last_block_in_file; page_block = 0; /* * Map blocks using the previous result first. */ if ((map.m_flags & EXT4_MAP_MAPPED) && block_in_file > map.m_lblk && block_in_file < (map.m_lblk + map.m_len)) { unsigned map_offset = block_in_file - map.m_lblk; unsigned last = map.m_len - map_offset; first_block = map.m_pblk + map_offset; for (relative_block = 0; ; relative_block++) { if (relative_block == last) { /* needed? */ map.m_flags &= ~EXT4_MAP_MAPPED; break; } if (page_block == blocks_per_folio) break; page_block++; block_in_file++; } } /* * Then do more ext4_map_blocks() calls until we are * done with this folio. */ while (page_block < blocks_per_folio) { if (block_in_file < last_block) { map.m_lblk = block_in_file; map.m_len = last_block - block_in_file; if (ext4_map_blocks(NULL, inode, &map, 0) < 0) { set_error_page: folio_zero_segment(folio, 0, folio_size(folio)); folio_unlock(folio); goto next_page; } } if ((map.m_flags & EXT4_MAP_MAPPED) == 0) { fully_mapped = 0; if (first_hole == blocks_per_folio) first_hole = page_block; page_block++; block_in_file++; continue; } if (first_hole != blocks_per_folio) goto confused; /* hole -> non-hole */ /* Contiguous blocks? */ if (!page_block) first_block = map.m_pblk; else if (first_block + page_block != map.m_pblk) goto confused; for (relative_block = 0; ; relative_block++) { if (relative_block == map.m_len) { /* needed? */ map.m_flags &= ~EXT4_MAP_MAPPED; break; } else if (page_block == blocks_per_folio) break; page_block++; block_in_file++; } } if (first_hole != blocks_per_folio) { folio_zero_segment(folio, first_hole << blkbits, folio_size(folio)); if (first_hole == 0) { if (ext4_need_verity(inode, folio->index) && !fsverity_verify_folio(folio)) goto set_error_page; folio_end_read(folio, true); continue; } } else if (fully_mapped) { folio_set_mappedtodisk(folio); } /* * This folio will go to BIO. Do we need to send this * BIO off first? */ if (bio && (last_block_in_bio != first_block - 1 || !fscrypt_mergeable_bio(bio, inode, next_block))) { submit_and_realloc: submit_bio(bio); bio = NULL; } if (bio == NULL) { /* * bio_alloc will _always_ be able to allocate a bio if * __GFP_DIRECT_RECLAIM is set, see bio_alloc_bioset(). */ bio = bio_alloc(bdev, bio_max_segs(nr_pages), REQ_OP_READ, GFP_KERNEL); fscrypt_set_bio_crypt_ctx(bio, inode, next_block, GFP_KERNEL); ext4_set_bio_post_read_ctx(bio, inode, folio->index); bio->bi_iter.bi_sector = first_block << (blkbits - 9); bio->bi_end_io = mpage_end_io; if (rac) bio->bi_opf |= REQ_RAHEAD; } length = first_hole << blkbits; if (!bio_add_folio(bio, folio, length, 0)) goto submit_and_realloc; if (((map.m_flags & EXT4_MAP_BOUNDARY) && (relative_block == map.m_len)) || (first_hole != blocks_per_folio)) { submit_bio(bio); bio = NULL; } else last_block_in_bio = first_block + blocks_per_folio - 1; continue; confused: if (bio) { submit_bio(bio); bio = NULL; } if (!folio_test_uptodate(folio)) block_read_full_folio(folio, ext4_get_block); else folio_unlock(folio); next_page: ; /* A label shall be followed by a statement until C23 */ } if (bio) submit_bio(bio); return 0; } int __init ext4_init_post_read_processing(void) { bio_post_read_ctx_cache = KMEM_CACHE(bio_post_read_ctx, SLAB_RECLAIM_ACCOUNT); if (!bio_post_read_ctx_cache) goto fail; bio_post_read_ctx_pool = mempool_create_slab_pool(NUM_PREALLOC_POST_READ_CTXS, bio_post_read_ctx_cache); if (!bio_post_read_ctx_pool) goto fail_free_cache; return 0; fail_free_cache: kmem_cache_destroy(bio_post_read_ctx_cache); fail: return -ENOMEM; } void ext4_exit_post_read_processing(void) { mempool_destroy(bio_post_read_ctx_pool); kmem_cache_destroy(bio_post_read_ctx_cache); } |
| 222 222 223 222 223 6 6 6 223 222 4 223 10 10 10 10 684 283 195 194 283 256 255 198 255 255 256 221 221 221 184 184 183 184 221 114 79 79 79 79 79 114 135 133 2 2 135 6 5 6 6 6 6 6 185 184 185 184 185 223 185 47 46 2 46 47 67 43 43 75 280 223 221 221 1 8 8 7 224 224 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 | // SPDX-License-Identifier: GPL-2.0-or-later /* SCTP kernel implementation * (C) Copyright IBM Corp. 2001, 2003 * Copyright (c) Cisco 1999,2000 * Copyright (c) Motorola 1999,2000,2001 * Copyright (c) La Monte H.P. Yarroll 2001 * * This file is part of the SCTP kernel implementation. * * A collection class to handle the storage of transport addresses. * * Please send any bug reports or fixes you make to the * email address(es): * lksctp developers <linux-sctp@vger.kernel.org> * * Written or modified by: * La Monte H.P. Yarroll <piggy@acm.org> * Karl Knutson <karl@athena.chicago.il.us> * Jon Grimm <jgrimm@us.ibm.com> * Daisy Chang <daisyc@us.ibm.com> */ #include <linux/types.h> #include <linux/slab.h> #include <linux/in.h> #include <net/sock.h> #include <net/ipv6.h> #include <net/if_inet6.h> #include <net/sctp/sctp.h> #include <net/sctp/sm.h> /* Forward declarations for internal helpers. */ static int sctp_copy_one_addr(struct net *net, struct sctp_bind_addr *dest, union sctp_addr *addr, enum sctp_scope scope, gfp_t gfp, int flags); static void sctp_bind_addr_clean(struct sctp_bind_addr *); /* First Level Abstractions. */ /* Copy 'src' to 'dest' taking 'scope' into account. Omit addresses * in 'src' which have a broader scope than 'scope'. */ int sctp_bind_addr_copy(struct net *net, struct sctp_bind_addr *dest, const struct sctp_bind_addr *src, enum sctp_scope scope, gfp_t gfp, int flags) { struct sctp_sockaddr_entry *addr; int error = 0; /* All addresses share the same port. */ dest->port = src->port; /* Extract the addresses which are relevant for this scope. */ list_for_each_entry(addr, &src->address_list, list) { error = sctp_copy_one_addr(net, dest, &addr->a, scope, gfp, flags); if (error < 0) goto out; } /* If there are no addresses matching the scope and * this is global scope, try to get a link scope address, with * the assumption that we must be sitting behind a NAT. */ if (list_empty(&dest->address_list) && (SCTP_SCOPE_GLOBAL == scope)) { list_for_each_entry(addr, &src->address_list, list) { error = sctp_copy_one_addr(net, dest, &addr->a, SCTP_SCOPE_LINK, gfp, flags); if (error < 0) goto out; } } /* If somehow no addresses were found that can be used with this * scope, it's an error. */ if (list_empty(&dest->address_list)) error = -ENETUNREACH; out: if (error) sctp_bind_addr_clean(dest); return error; } /* Exactly duplicate the address lists. This is necessary when doing * peer-offs and accepts. We don't want to put all the current system * addresses into the endpoint. That's useless. But we do want duplicat * the list of bound addresses that the older endpoint used. */ int sctp_bind_addr_dup(struct sctp_bind_addr *dest, const struct sctp_bind_addr *src, gfp_t gfp) { struct sctp_sockaddr_entry *addr; int error = 0; /* All addresses share the same port. */ dest->port = src->port; list_for_each_entry(addr, &src->address_list, list) { error = sctp_add_bind_addr(dest, &addr->a, sizeof(addr->a), 1, gfp); if (error < 0) break; } return error; } /* Initialize the SCTP_bind_addr structure for either an endpoint or * an association. */ void sctp_bind_addr_init(struct sctp_bind_addr *bp, __u16 port) { INIT_LIST_HEAD(&bp->address_list); bp->port = port; } /* Dispose of the address list. */ static void sctp_bind_addr_clean(struct sctp_bind_addr *bp) { struct sctp_sockaddr_entry *addr, *temp; /* Empty the bind address list. */ list_for_each_entry_safe(addr, temp, &bp->address_list, list) { list_del_rcu(&addr->list); kfree_rcu(addr, rcu); SCTP_DBG_OBJCNT_DEC(addr); } } /* Dispose of an SCTP_bind_addr structure */ void sctp_bind_addr_free(struct sctp_bind_addr *bp) { /* Empty the bind address list. */ sctp_bind_addr_clean(bp); } /* Add an address to the bind address list in the SCTP_bind_addr structure. */ int sctp_add_bind_addr(struct sctp_bind_addr *bp, union sctp_addr *new, int new_size, __u8 addr_state, gfp_t gfp) { struct sctp_sockaddr_entry *addr; /* Add the address to the bind address list. */ addr = kzalloc(sizeof(*addr), gfp); if (!addr) return -ENOMEM; memcpy(&addr->a, new, min_t(size_t, sizeof(*new), new_size)); /* Fix up the port if it has not yet been set. * Both v4 and v6 have the port at the same offset. */ if (!addr->a.v4.sin_port) addr->a.v4.sin_port = htons(bp->port); addr->state = addr_state; addr->valid = 1; INIT_LIST_HEAD(&addr->list); /* We always hold a socket lock when calling this function, * and that acts as a writer synchronizing lock. */ list_add_tail_rcu(&addr->list, &bp->address_list); SCTP_DBG_OBJCNT_INC(addr); return 0; } /* Delete an address from the bind address list in the SCTP_bind_addr * structure. */ int sctp_del_bind_addr(struct sctp_bind_addr *bp, union sctp_addr *del_addr) { struct sctp_sockaddr_entry *addr, *temp; int found = 0; /* We hold the socket lock when calling this function, * and that acts as a writer synchronizing lock. */ list_for_each_entry_safe(addr, temp, &bp->address_list, list) { if (sctp_cmp_addr_exact(&addr->a, del_addr)) { /* Found the exact match. */ found = 1; addr->valid = 0; list_del_rcu(&addr->list); break; } } if (found) { kfree_rcu(addr, rcu); SCTP_DBG_OBJCNT_DEC(addr); return 0; } return -EINVAL; } /* Create a network byte-order representation of all the addresses * formated as SCTP parameters. * * The second argument is the return value for the length. */ union sctp_params sctp_bind_addrs_to_raw(const struct sctp_bind_addr *bp, int *addrs_len, gfp_t gfp) { union sctp_params addrparms; union sctp_params retval; int addrparms_len; union sctp_addr_param rawaddr; int len; struct sctp_sockaddr_entry *addr; struct list_head *pos; struct sctp_af *af; addrparms_len = 0; len = 0; /* Allocate enough memory at once. */ list_for_each(pos, &bp->address_list) { len += sizeof(union sctp_addr_param); } /* Don't even bother embedding an address if there * is only one. */ if (len == sizeof(union sctp_addr_param)) { retval.v = NULL; goto end_raw; } retval.v = kmalloc(len, gfp); if (!retval.v) goto end_raw; addrparms = retval; list_for_each_entry(addr, &bp->address_list, list) { af = sctp_get_af_specific(addr->a.v4.sin_family); len = af->to_addr_param(&addr->a, &rawaddr); memcpy(addrparms.v, &rawaddr, len); addrparms.v += len; addrparms_len += len; } end_raw: *addrs_len = addrparms_len; return retval; } /* * Create an address list out of the raw address list format (IPv4 and IPv6 * address parameters). */ int sctp_raw_to_bind_addrs(struct sctp_bind_addr *bp, __u8 *raw_addr_list, int addrs_len, __u16 port, gfp_t gfp) { union sctp_addr_param *rawaddr; struct sctp_paramhdr *param; union sctp_addr addr; int retval = 0; int len; struct sctp_af *af; /* Convert the raw address to standard address format */ while (addrs_len) { param = (struct sctp_paramhdr *)raw_addr_list; rawaddr = (union sctp_addr_param *)raw_addr_list; af = sctp_get_af_specific(param_type2af(param->type)); if (unlikely(!af) || !af->from_addr_param(&addr, rawaddr, htons(port), 0)) { retval = -EINVAL; goto out_err; } if (sctp_bind_addr_state(bp, &addr) != -1) goto next; retval = sctp_add_bind_addr(bp, &addr, sizeof(addr), SCTP_ADDR_SRC, gfp); if (retval) /* Can't finish building the list, clean up. */ goto out_err; next: len = ntohs(param->length); addrs_len -= len; raw_addr_list += len; } return retval; out_err: if (retval) sctp_bind_addr_clean(bp); return retval; } /******************************************************************** * 2nd Level Abstractions ********************************************************************/ /* Does this contain a specified address? Allow wildcarding. */ int sctp_bind_addr_match(struct sctp_bind_addr *bp, const union sctp_addr *addr, struct sctp_sock *opt) { struct sctp_sockaddr_entry *laddr; int match = 0; rcu_read_lock(); list_for_each_entry_rcu(laddr, &bp->address_list, list) { if (!laddr->valid) continue; if (opt->pf->cmp_addr(&laddr->a, addr, opt)) { match = 1; break; } } rcu_read_unlock(); return match; } int sctp_bind_addrs_check(struct sctp_sock *sp, struct sctp_sock *sp2, int cnt2) { struct sctp_bind_addr *bp2 = &sp2->ep->base.bind_addr; struct sctp_bind_addr *bp = &sp->ep->base.bind_addr; struct sctp_sockaddr_entry *laddr, *laddr2; bool exist = false; int cnt = 0; rcu_read_lock(); list_for_each_entry_rcu(laddr, &bp->address_list, list) { list_for_each_entry_rcu(laddr2, &bp2->address_list, list) { if (sp->pf->af->cmp_addr(&laddr->a, &laddr2->a) && laddr->valid && laddr2->valid) { exist = true; goto next; } } cnt = 0; break; next: cnt++; } rcu_read_unlock(); return (cnt == cnt2) ? 0 : (exist ? -EEXIST : 1); } /* Does the address 'addr' conflict with any addresses in * the bp. */ int sctp_bind_addr_conflict(struct sctp_bind_addr *bp, const union sctp_addr *addr, struct sctp_sock *bp_sp, struct sctp_sock *addr_sp) { struct sctp_sockaddr_entry *laddr; int conflict = 0; struct sctp_sock *sp; /* Pick the IPv6 socket as the basis of comparison * since it's usually a superset of the IPv4. * If there is no IPv6 socket, then default to bind_addr. */ if (sctp_opt2sk(bp_sp)->sk_family == AF_INET6) sp = bp_sp; else if (sctp_opt2sk(addr_sp)->sk_family == AF_INET6) sp = addr_sp; else sp = bp_sp; rcu_read_lock(); list_for_each_entry_rcu(laddr, &bp->address_list, list) { if (!laddr->valid) continue; conflict = sp->pf->cmp_addr(&laddr->a, addr, sp); if (conflict) break; } rcu_read_unlock(); return conflict; } /* Get the state of the entry in the bind_addr_list */ int sctp_bind_addr_state(const struct sctp_bind_addr *bp, const union sctp_addr *addr) { struct sctp_sockaddr_entry *laddr; struct sctp_af *af; af = sctp_get_af_specific(addr->sa.sa_family); if (unlikely(!af)) return -1; list_for_each_entry_rcu(laddr, &bp->address_list, list) { if (!laddr->valid) continue; if (af->cmp_addr(&laddr->a, addr)) return laddr->state; } return -1; } /* Find the first address in the bind address list that is not present in * the addrs packed array. */ union sctp_addr *sctp_find_unmatch_addr(struct sctp_bind_addr *bp, const union sctp_addr *addrs, int addrcnt, struct sctp_sock *opt) { struct sctp_sockaddr_entry *laddr; union sctp_addr *addr; void *addr_buf; struct sctp_af *af; int i; /* This is only called sctp_send_asconf_del_ip() and we hold * the socket lock in that code patch, so that address list * can't change. */ list_for_each_entry(laddr, &bp->address_list, list) { addr_buf = (union sctp_addr *)addrs; for (i = 0; i < addrcnt; i++) { addr = addr_buf; af = sctp_get_af_specific(addr->v4.sin_family); if (!af) break; if (opt->pf->cmp_addr(&laddr->a, addr, opt)) break; addr_buf += af->sockaddr_len; } if (i == addrcnt) return &laddr->a; } return NULL; } /* Copy out addresses from the global local address list. */ static int sctp_copy_one_addr(struct net *net, struct sctp_bind_addr *dest, union sctp_addr *addr, enum sctp_scope scope, gfp_t gfp, int flags) { int error = 0; if (sctp_is_any(NULL, addr)) { error = sctp_copy_local_addr_list(net, dest, scope, gfp, flags); } else if (sctp_in_scope(net, addr, scope)) { /* Now that the address is in scope, check to see if * the address type is supported by local sock as * well as the remote peer. */ if ((((AF_INET == addr->sa.sa_family) && (flags & SCTP_ADDR4_ALLOWED) && (flags & SCTP_ADDR4_PEERSUPP))) || (((AF_INET6 == addr->sa.sa_family) && (flags & SCTP_ADDR6_ALLOWED) && (flags & SCTP_ADDR6_PEERSUPP)))) error = sctp_add_bind_addr(dest, addr, sizeof(*addr), SCTP_ADDR_SRC, gfp); } return error; } /* Is this a wildcard address? */ int sctp_is_any(struct sock *sk, const union sctp_addr *addr) { unsigned short fam = 0; struct sctp_af *af; /* Try to get the right address family */ if (addr->sa.sa_family != AF_UNSPEC) fam = addr->sa.sa_family; else if (sk) fam = sk->sk_family; af = sctp_get_af_specific(fam); if (!af) return 0; return af->is_any(addr); } /* Is 'addr' valid for 'scope'? */ int sctp_in_scope(struct net *net, const union sctp_addr *addr, enum sctp_scope scope) { enum sctp_scope addr_scope = sctp_scope(addr); /* The unusable SCTP addresses will not be considered with * any defined scopes. */ if (SCTP_SCOPE_UNUSABLE == addr_scope) return 0; /* * For INIT and INIT-ACK address list, let L be the level of * requested destination address, sender and receiver * SHOULD include all of its addresses with level greater * than or equal to L. * * Address scoping can be selectively controlled via sysctl * option */ switch (net->sctp.scope_policy) { case SCTP_SCOPE_POLICY_DISABLE: return 1; case SCTP_SCOPE_POLICY_ENABLE: if (addr_scope <= scope) return 1; break; case SCTP_SCOPE_POLICY_PRIVATE: if (addr_scope <= scope || SCTP_SCOPE_PRIVATE == addr_scope) return 1; break; case SCTP_SCOPE_POLICY_LINK: if (addr_scope <= scope || SCTP_SCOPE_LINK == addr_scope) return 1; break; default: break; } return 0; } int sctp_is_ep_boundall(struct sock *sk) { struct sctp_bind_addr *bp; struct sctp_sockaddr_entry *addr; bp = &sctp_sk(sk)->ep->base.bind_addr; if (sctp_list_single_entry(&bp->address_list)) { addr = list_entry(bp->address_list.next, struct sctp_sockaddr_entry, list); if (sctp_is_any(sk, &addr->a)) return 1; } return 0; } /******************************************************************** * 3rd Level Abstractions ********************************************************************/ /* What is the scope of 'addr'? */ enum sctp_scope sctp_scope(const union sctp_addr *addr) { struct sctp_af *af; af = sctp_get_af_specific(addr->sa.sa_family); if (!af) return SCTP_SCOPE_UNUSABLE; return af->scope((union sctp_addr *)addr); } |
| 12 12 12 12 11 3 1202 10 1202 11 2 9 7 2 115 26 12 11 91 102 1 102 102 102 114 649 24 24 24 1 23 23 19 18 23 18 23 22 24 24 25 25 19 10 10 10 10 10 25 20 103 33 25 29 25 20 20 20 20 20 224 224 3 47 356 1 356 222 356 160 160 160 160 160 160 159 159 21 21 19 19 142 4 4 142 1 142 137 137 1 136 5 7 137 292 291 291 291 258 202 291 285 285 276 276 41 41 35 35 34 34 34 34 34 34 271 270 268 268 5 5 268 17 268 262 262 2 260 6 291 291 291 283 222 231 9 9 9 9 9 9 349 349 349 349 3 219 349 349 34 34 23 34 32 32 32 32 9 6 26 6 3 9 32 28 28 6 6 6 5 5 5 5 5 5 26 6 6 6 3 6 26 26 26 26 32 6 32 32 32 34 34 34 34 34 28 34 34 34 33 34 14 49 47 47 5 50 5 4 50 49 48 48 14 34 91 91 12 12 22 22 17 82 83 82 82 81 81 81 81 81 79 25 10 6 18 18 8 3 8 3 18 19 19 18 99 18 18 9 2 4 4 4 3 2 1 3 6 5 13 13 5 3 2 753 13 754 83 33 9 348 50 232 100 6 2 8 3 27 2 2 2 1 4 3 1 4 4 3 7 49 12 115 115 1 797 3 49 50 35 20 35 23 23 755 50 35 731 733 25 573 734 228 733 548 185 400 403 148 289 287 810 121 680 807 809 592 121 460 593 1 50 797 50 13 797 790 797 795 798 9 794 792 795 91 742 578 787 785 784 785 785 566 574 50 787 13 12 13 13 9 9 9 9 9 13 13 13 24 23 22 20 3 17 1 16 14 12 16 17 20 24 4 16 1 13 13 13 1 12 12 14 18 14 1 13 4 1 18 12 16 12 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 | // SPDX-License-Identifier: GPL-2.0 /* * linux/mm/madvise.c * * Copyright (C) 1999 Linus Torvalds * Copyright (C) 2002 Christoph Hellwig */ #include <linux/mman.h> #include <linux/pagemap.h> #include <linux/syscalls.h> #include <linux/mempolicy.h> #include <linux/page-isolation.h> #include <linux/page_idle.h> #include <linux/userfaultfd_k.h> #include <linux/hugetlb.h> #include <linux/falloc.h> #include <linux/fadvise.h> #include <linux/sched.h> #include <linux/sched/mm.h> #include <linux/mm_inline.h> #include <linux/mmu_context.h> #include <linux/string.h> #include <linux/uio.h> #include <linux/ksm.h> #include <linux/fs.h> #include <linux/file.h> #include <linux/blkdev.h> #include <linux/backing-dev.h> #include <linux/pagewalk.h> #include <linux/swap.h> #include <linux/swapops.h> #include <linux/shmem_fs.h> #include <linux/mmu_notifier.h> #include <asm/tlb.h> #include "internal.h" #include "swap.h" #define __MADV_SET_ANON_VMA_NAME (-1) /* * Maximum number of attempts we make to install guard pages before we give up * and return -ERESTARTNOINTR to have userspace try again. */ #define MAX_MADVISE_GUARD_RETRIES 3 struct madvise_walk_private { struct mmu_gather *tlb; bool pageout; }; enum madvise_lock_mode { MADVISE_NO_LOCK, MADVISE_MMAP_READ_LOCK, MADVISE_MMAP_WRITE_LOCK, MADVISE_VMA_READ_LOCK, }; struct madvise_behavior_range { unsigned long start; unsigned long end; }; struct madvise_behavior { struct mm_struct *mm; int behavior; struct mmu_gather *tlb; enum madvise_lock_mode lock_mode; struct anon_vma_name *anon_name; /* * The range over which the behaviour is currently being applied. If * traversing multiple VMAs, this is updated for each. */ struct madvise_behavior_range range; /* The VMA and VMA preceding it (if applicable) currently targeted. */ struct vm_area_struct *prev; struct vm_area_struct *vma; bool lock_dropped; }; #ifdef CONFIG_ANON_VMA_NAME static int madvise_walk_vmas(struct madvise_behavior *madv_behavior); struct anon_vma_name *anon_vma_name_alloc(const char *name) { struct anon_vma_name *anon_name; size_t count; /* Add 1 for NUL terminator at the end of the anon_name->name */ count = strlen(name) + 1; anon_name = kmalloc(struct_size(anon_name, name, count), GFP_KERNEL); if (anon_name) { kref_init(&anon_name->kref); memcpy(anon_name->name, name, count); } return anon_name; } void anon_vma_name_free(struct kref *kref) { struct anon_vma_name *anon_name = container_of(kref, struct anon_vma_name, kref); kfree(anon_name); } struct anon_vma_name *anon_vma_name(struct vm_area_struct *vma) { if (!rwsem_is_locked(&vma->vm_mm->mmap_lock)) vma_assert_locked(vma); return vma->anon_name; } /* mmap_lock should be write-locked */ static int replace_anon_vma_name(struct vm_area_struct *vma, struct anon_vma_name *anon_name) { struct anon_vma_name *orig_name = anon_vma_name(vma); if (!anon_name) { vma->anon_name = NULL; anon_vma_name_put(orig_name); return 0; } if (anon_vma_name_eq(orig_name, anon_name)) return 0; vma->anon_name = anon_vma_name_reuse(anon_name); anon_vma_name_put(orig_name); return 0; } #else /* CONFIG_ANON_VMA_NAME */ static int replace_anon_vma_name(struct vm_area_struct *vma, struct anon_vma_name *anon_name) { if (anon_name) return -EINVAL; return 0; } #endif /* CONFIG_ANON_VMA_NAME */ /* * Update the vm_flags or anon_name on region of a vma, splitting it or merging * it as necessary. Must be called with mmap_lock held for writing. */ static int madvise_update_vma(vm_flags_t new_flags, struct madvise_behavior *madv_behavior) { struct vm_area_struct *vma = madv_behavior->vma; struct madvise_behavior_range *range = &madv_behavior->range; struct anon_vma_name *anon_name = madv_behavior->anon_name; bool set_new_anon_name = madv_behavior->behavior == __MADV_SET_ANON_VMA_NAME; VMA_ITERATOR(vmi, madv_behavior->mm, range->start); if (new_flags == vma->vm_flags && (!set_new_anon_name || anon_vma_name_eq(anon_vma_name(vma), anon_name))) return 0; if (set_new_anon_name) vma = vma_modify_name(&vmi, madv_behavior->prev, vma, range->start, range->end, anon_name); else vma = vma_modify_flags(&vmi, madv_behavior->prev, vma, range->start, range->end, new_flags); if (IS_ERR(vma)) return PTR_ERR(vma); madv_behavior->vma = vma; /* vm_flags is protected by the mmap_lock held in write mode. */ vma_start_write(vma); vm_flags_reset(vma, new_flags); if (set_new_anon_name) return replace_anon_vma_name(vma, anon_name); return 0; } #ifdef CONFIG_SWAP static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start, unsigned long end, struct mm_walk *walk) { struct vm_area_struct *vma = walk->private; struct swap_iocb *splug = NULL; pte_t *ptep = NULL; spinlock_t *ptl; unsigned long addr; for (addr = start; addr < end; addr += PAGE_SIZE) { pte_t pte; swp_entry_t entry; struct folio *folio; if (!ptep++) { ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); if (!ptep) break; } pte = ptep_get(ptep); if (!is_swap_pte(pte)) continue; entry = pte_to_swp_entry(pte); if (unlikely(non_swap_entry(entry))) continue; pte_unmap_unlock(ptep, ptl); ptep = NULL; folio = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, vma, addr, &splug); if (folio) folio_put(folio); } if (ptep) pte_unmap_unlock(ptep, ptl); swap_read_unplug(splug); cond_resched(); return 0; } static const struct mm_walk_ops swapin_walk_ops = { .pmd_entry = swapin_walk_pmd_entry, .walk_lock = PGWALK_RDLOCK, }; static void shmem_swapin_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, struct address_space *mapping) { XA_STATE(xas, &mapping->i_pages, linear_page_index(vma, start)); pgoff_t end_index = linear_page_index(vma, end) - 1; struct folio *folio; struct swap_iocb *splug = NULL; rcu_read_lock(); xas_for_each(&xas, folio, end_index) { unsigned long addr; swp_entry_t entry; if (!xa_is_value(folio)) continue; entry = radix_to_swp_entry(folio); /* There might be swapin error entries in shmem mapping. */ if (non_swap_entry(entry)) continue; addr = vma->vm_start + ((xas.xa_index - vma->vm_pgoff) << PAGE_SHIFT); xas_pause(&xas); rcu_read_unlock(); folio = read_swap_cache_async(entry, mapping_gfp_mask(mapping), vma, addr, &splug); if (folio) folio_put(folio); rcu_read_lock(); } rcu_read_unlock(); swap_read_unplug(splug); } #endif /* CONFIG_SWAP */ static void mark_mmap_lock_dropped(struct madvise_behavior *madv_behavior) { VM_WARN_ON_ONCE(madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK); madv_behavior->lock_dropped = true; } /* * Schedule all required I/O operations. Do not wait for completion. */ static long madvise_willneed(struct madvise_behavior *madv_behavior) { struct vm_area_struct *vma = madv_behavior->vma; struct mm_struct *mm = madv_behavior->mm; struct file *file = vma->vm_file; unsigned long start = madv_behavior->range.start; unsigned long end = madv_behavior->range.end; loff_t offset; #ifdef CONFIG_SWAP if (!file) { walk_page_range_vma(vma, start, end, &swapin_walk_ops, vma); lru_add_drain(); /* Push any new pages onto the LRU now */ return 0; } if (shmem_mapping(file->f_mapping)) { shmem_swapin_range(vma, start, end, file->f_mapping); lru_add_drain(); /* Push any new pages onto the LRU now */ return 0; } #else if (!file) return -EBADF; #endif if (IS_DAX(file_inode(file))) { /* no bad return value, but ignore advice */ return 0; } /* * Filesystem's fadvise may need to take various locks. We need to * explicitly grab a reference because the vma (and hence the * vma's reference to the file) can go away as soon as we drop * mmap_lock. */ mark_mmap_lock_dropped(madv_behavior); get_file(file); offset = (loff_t)(start - vma->vm_start) + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); mmap_read_unlock(mm); vfs_fadvise(file, offset, end - start, POSIX_FADV_WILLNEED); fput(file); mmap_read_lock(mm); return 0; } static inline bool can_do_file_pageout(struct vm_area_struct *vma) { if (!vma->vm_file) return false; /* * paging out pagecache only for non-anonymous mappings that correspond * to the files the calling process could (if tried) open for writing; * otherwise we'd be including shared non-exclusive mappings, which * opens a side channel. */ return inode_owner_or_capable(&nop_mnt_idmap, file_inode(vma->vm_file)) || file_permission(vma->vm_file, MAY_WRITE) == 0; } static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end, struct folio *folio, pte_t *ptep, pte_t *ptentp) { int max_nr = (end - addr) / PAGE_SIZE; return folio_pte_batch_flags(folio, NULL, ptep, ptentp, max_nr, FPB_MERGE_YOUNG_DIRTY); } static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct madvise_walk_private *private = walk->private; struct mmu_gather *tlb = private->tlb; bool pageout = private->pageout; struct mm_struct *mm = tlb->mm; struct vm_area_struct *vma = walk->vma; pte_t *start_pte, *pte, ptent; spinlock_t *ptl; struct folio *folio = NULL; LIST_HEAD(folio_list); bool pageout_anon_only_filter; unsigned int batch_count = 0; int nr; if (fatal_signal_pending(current)) return -EINTR; pageout_anon_only_filter = pageout && !vma_is_anonymous(vma) && !can_do_file_pageout(vma); #ifdef CONFIG_TRANSPARENT_HUGEPAGE if (pmd_trans_huge(*pmd)) { pmd_t orig_pmd; unsigned long next = pmd_addr_end(addr, end); tlb_change_page_size(tlb, HPAGE_PMD_SIZE); ptl = pmd_trans_huge_lock(pmd, vma); if (!ptl) return 0; orig_pmd = *pmd; if (is_huge_zero_pmd(orig_pmd)) goto huge_unlock; if (unlikely(!pmd_present(orig_pmd))) { VM_BUG_ON(thp_migration_supported() && !is_pmd_migration_entry(orig_pmd)); goto huge_unlock; } folio = pmd_folio(orig_pmd); /* Do not interfere with other mappings of this folio */ if (folio_maybe_mapped_shared(folio)) goto huge_unlock; if (pageout_anon_only_filter && !folio_test_anon(folio)) goto huge_unlock; if (next - addr != HPAGE_PMD_SIZE) { int err; folio_get(folio); spin_unlock(ptl); folio_lock(folio); err = split_folio(folio); folio_unlock(folio); folio_put(folio); if (!err) goto regular_folio; return 0; } if (!pageout && pmd_young(orig_pmd)) { pmdp_invalidate(vma, addr, pmd); orig_pmd = pmd_mkold(orig_pmd); set_pmd_at(mm, addr, pmd, orig_pmd); tlb_remove_pmd_tlb_entry(tlb, pmd, addr); } folio_clear_referenced(folio); folio_test_clear_young(folio); if (folio_test_active(folio)) folio_set_workingset(folio); if (pageout) { if (folio_isolate_lru(folio)) { if (folio_test_unevictable(folio)) folio_putback_lru(folio); else list_add(&folio->lru, &folio_list); } } else folio_deactivate(folio); huge_unlock: spin_unlock(ptl); if (pageout) reclaim_pages(&folio_list); return 0; } regular_folio: #endif tlb_change_page_size(tlb, PAGE_SIZE); restart: start_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); if (!start_pte) return 0; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) { nr = 1; ptent = ptep_get(pte); if (++batch_count == SWAP_CLUSTER_MAX) { batch_count = 0; if (need_resched()) { arch_leave_lazy_mmu_mode(); pte_unmap_unlock(start_pte, ptl); cond_resched(); goto restart; } } if (pte_none(ptent)) continue; if (!pte_present(ptent)) continue; folio = vm_normal_folio(vma, addr, ptent); if (!folio || folio_is_zone_device(folio)) continue; /* * If we encounter a large folio, only split it if it is not * fully mapped within the range we are operating on. Otherwise * leave it as is so that it can be swapped out whole. If we * fail to split a folio, leave it in place and advance to the * next pte in the range. */ if (folio_test_large(folio)) { nr = madvise_folio_pte_batch(addr, end, folio, pte, &ptent); if (nr < folio_nr_pages(folio)) { int err; if (folio_maybe_mapped_shared(folio)) continue; if (pageout_anon_only_filter && !folio_test_anon(folio)) continue; if (!folio_trylock(folio)) continue; folio_get(folio); arch_leave_lazy_mmu_mode(); pte_unmap_unlock(start_pte, ptl); start_pte = NULL; err = split_folio(folio); folio_unlock(folio); folio_put(folio); start_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); if (!start_pte) break; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); if (!err) nr = 0; continue; } } /* * Do not interfere with other mappings of this folio and * non-LRU folio. If we have a large folio at this point, we * know it is fully mapped so if its mapcount is the same as its * number of pages, it must be exclusive. */ if (!folio_test_lru(folio) || folio_mapcount(folio) != folio_nr_pages(folio)) continue; if (pageout_anon_only_filter && !folio_test_anon(folio)) continue; if (!pageout && pte_young(ptent)) { clear_young_dirty_ptes(vma, addr, pte, nr, CYDP_CLEAR_YOUNG); tlb_remove_tlb_entries(tlb, pte, nr, addr); } /* * We are deactivating a folio for accelerating reclaiming. * VM couldn't reclaim the folio unless we clear PG_young. * As a side effect, it makes confuse idle-page tracking * because they will miss recent referenced history. */ folio_clear_referenced(folio); folio_test_clear_young(folio); if (folio_test_active(folio)) folio_set_workingset(folio); if (pageout) { if (folio_isolate_lru(folio)) { if (folio_test_unevictable(folio)) folio_putback_lru(folio); else list_add(&folio->lru, &folio_list); } } else folio_deactivate(folio); } if (start_pte) { arch_leave_lazy_mmu_mode(); pte_unmap_unlock(start_pte, ptl); } if (pageout) reclaim_pages(&folio_list); cond_resched(); return 0; } static const struct mm_walk_ops cold_walk_ops = { .pmd_entry = madvise_cold_or_pageout_pte_range, .walk_lock = PGWALK_RDLOCK, }; static void madvise_cold_page_range(struct mmu_gather *tlb, struct madvise_behavior *madv_behavior) { struct vm_area_struct *vma = madv_behavior->vma; struct madvise_behavior_range *range = &madv_behavior->range; struct madvise_walk_private walk_private = { .pageout = false, .tlb = tlb, }; tlb_start_vma(tlb, vma); walk_page_range_vma(vma, range->start, range->end, &cold_walk_ops, &walk_private); tlb_end_vma(tlb, vma); } static inline bool can_madv_lru_vma(struct vm_area_struct *vma) { return !(vma->vm_flags & (VM_LOCKED|VM_PFNMAP|VM_HUGETLB)); } static long madvise_cold(struct madvise_behavior *madv_behavior) { struct vm_area_struct *vma = madv_behavior->vma; struct mmu_gather tlb; if (!can_madv_lru_vma(vma)) return -EINVAL; lru_add_drain(); tlb_gather_mmu(&tlb, madv_behavior->mm); madvise_cold_page_range(&tlb, madv_behavior); tlb_finish_mmu(&tlb); return 0; } static void madvise_pageout_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, struct madvise_behavior_range *range) { struct madvise_walk_private walk_private = { .pageout = true, .tlb = tlb, }; tlb_start_vma(tlb, vma); walk_page_range_vma(vma, range->start, range->end, &cold_walk_ops, &walk_private); tlb_end_vma(tlb, vma); } static long madvise_pageout(struct madvise_behavior *madv_behavior) { struct mmu_gather tlb; struct vm_area_struct *vma = madv_behavior->vma; if (!can_madv_lru_vma(vma)) return -EINVAL; /* * If the VMA belongs to a private file mapping, there can be private * dirty pages which can be paged out if even this process is neither * owner nor write capable of the file. We allow private file mappings * further to pageout dirty anon pages. */ if (!vma_is_anonymous(vma) && (!can_do_file_pageout(vma) && (vma->vm_flags & VM_MAYSHARE))) return 0; lru_add_drain(); tlb_gather_mmu(&tlb, madv_behavior->mm); madvise_pageout_page_range(&tlb, vma, &madv_behavior->range); tlb_finish_mmu(&tlb); return 0; } static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { const cydp_t cydp_flags = CYDP_CLEAR_YOUNG | CYDP_CLEAR_DIRTY; struct mmu_gather *tlb = walk->private; struct mm_struct *mm = tlb->mm; struct vm_area_struct *vma = walk->vma; spinlock_t *ptl; pte_t *start_pte, *pte, ptent; struct folio *folio; int nr_swap = 0; unsigned long next; int nr, max_nr; next = pmd_addr_end(addr, end); if (pmd_trans_huge(*pmd)) if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next)) return 0; tlb_change_page_size(tlb, PAGE_SIZE); start_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); if (!start_pte) return 0; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) { nr = 1; ptent = ptep_get(pte); if (pte_none(ptent)) continue; /* * If the pte has swp_entry, just clear page table to * prevent swap-in which is more expensive rather than * (page allocation + zeroing). */ if (!pte_present(ptent)) { swp_entry_t entry; entry = pte_to_swp_entry(ptent); if (!non_swap_entry(entry)) { max_nr = (end - addr) / PAGE_SIZE; nr = swap_pte_batch(pte, max_nr, ptent); nr_swap -= nr; free_swap_and_cache_nr(entry, nr); clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); } else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) { pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); } continue; } folio = vm_normal_folio(vma, addr, ptent); if (!folio || folio_is_zone_device(folio)) continue; /* * If we encounter a large folio, only split it if it is not * fully mapped within the range we are operating on. Otherwise * leave it as is so that it can be marked as lazyfree. If we * fail to split a folio, leave it in place and advance to the * next pte in the range. */ if (folio_test_large(folio)) { nr = madvise_folio_pte_batch(addr, end, folio, pte, &ptent); if (nr < folio_nr_pages(folio)) { int err; if (folio_maybe_mapped_shared(folio)) continue; if (!folio_trylock(folio)) continue; folio_get(folio); arch_leave_lazy_mmu_mode(); pte_unmap_unlock(start_pte, ptl); start_pte = NULL; err = split_folio(folio); folio_unlock(folio); folio_put(folio); pte = pte_offset_map_lock(mm, pmd, addr, &ptl); start_pte = pte; if (!start_pte) break; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); if (!err) nr = 0; continue; } } if (folio_test_swapcache(folio) || folio_test_dirty(folio)) { if (!folio_trylock(folio)) continue; /* * If we have a large folio at this point, we know it is * fully mapped so if its mapcount is the same as its * number of pages, it must be exclusive. */ if (folio_mapcount(folio) != folio_nr_pages(folio)) { folio_unlock(folio); continue; } if (folio_test_swapcache(folio) && !folio_free_swap(folio)) { folio_unlock(folio); continue; } folio_clear_dirty(folio); folio_unlock(folio); } if (pte_young(ptent) || pte_dirty(ptent)) { clear_young_dirty_ptes(vma, addr, pte, nr, cydp_flags); tlb_remove_tlb_entries(tlb, pte, nr, addr); } folio_mark_lazyfree(folio); } if (nr_swap) add_mm_counter(mm, MM_SWAPENTS, nr_swap); if (start_pte) { arch_leave_lazy_mmu_mode(); pte_unmap_unlock(start_pte, ptl); } cond_resched(); return 0; } static inline enum page_walk_lock get_walk_lock(enum madvise_lock_mode mode) { switch (mode) { case MADVISE_VMA_READ_LOCK: return PGWALK_VMA_RDLOCK_VERIFY; case MADVISE_MMAP_READ_LOCK: return PGWALK_RDLOCK; default: /* Other modes don't require fixing up the walk_lock */ WARN_ON_ONCE(1); return PGWALK_RDLOCK; } } static int madvise_free_single_vma(struct madvise_behavior *madv_behavior) { struct mm_struct *mm = madv_behavior->mm; struct vm_area_struct *vma = madv_behavior->vma; unsigned long start_addr = madv_behavior->range.start; unsigned long end_addr = madv_behavior->range.end; struct mmu_notifier_range range; struct mmu_gather *tlb = madv_behavior->tlb; struct mm_walk_ops walk_ops = { .pmd_entry = madvise_free_pte_range, }; /* MADV_FREE works for only anon vma at the moment */ if (!vma_is_anonymous(vma)) return -EINVAL; range.start = max(vma->vm_start, start_addr); if (range.start >= vma->vm_end) return -EINVAL; range.end = min(vma->vm_end, end_addr); if (range.end <= vma->vm_start) return -EINVAL; mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, range.start, range.end); lru_add_drain(); update_hiwater_rss(mm); mmu_notifier_invalidate_range_start(&range); tlb_start_vma(tlb, vma); walk_ops.walk_lock = get_walk_lock(madv_behavior->lock_mode); walk_page_range_vma(vma, range.start, range.end, &walk_ops, tlb); tlb_end_vma(tlb, vma); mmu_notifier_invalidate_range_end(&range); return 0; } /* * Application no longer needs these pages. If the pages are dirty, * it's OK to just throw them away. The app will be more careful about * data it wants to keep. Be sure to free swap resources too. The * zap_page_range_single call sets things up for shrink_active_list to actually * free these pages later if no one else has touched them in the meantime, * although we could add these pages to a global reuse list for * shrink_active_list to pick up before reclaiming other pages. * * NB: This interface discards data rather than pushes it out to swap, * as some implementations do. This has performance implications for * applications like large transactional databases which want to discard * pages in anonymous maps after committing to backing store the data * that was kept in them. There is no reason to write this data out to * the swap area if the application is discarding it. * * An interface that causes the system to free clean pages and flush * dirty pages is already available as msync(MS_INVALIDATE). */ static long madvise_dontneed_single_vma(struct madvise_behavior *madv_behavior) { struct madvise_behavior_range *range = &madv_behavior->range; struct zap_details details = { .reclaim_pt = true, .even_cows = true, }; zap_page_range_single_batched( madv_behavior->tlb, madv_behavior->vma, range->start, range->end - range->start, &details); return 0; } static bool madvise_dontneed_free_valid_vma(struct madvise_behavior *madv_behavior) { struct vm_area_struct *vma = madv_behavior->vma; int behavior = madv_behavior->behavior; struct madvise_behavior_range *range = &madv_behavior->range; if (!is_vm_hugetlb_page(vma)) { unsigned int forbidden = VM_PFNMAP; if (behavior != MADV_DONTNEED_LOCKED) forbidden |= VM_LOCKED; return !(vma->vm_flags & forbidden); } if (behavior != MADV_DONTNEED && behavior != MADV_DONTNEED_LOCKED) return false; if (range->start & ~huge_page_mask(hstate_vma(vma))) return false; /* * Madvise callers expect the length to be rounded up to PAGE_SIZE * boundaries, and may be unaware that this VMA uses huge pages. * Avoid unexpected data loss by rounding down the number of * huge pages freed. */ range->end = ALIGN_DOWN(range->end, huge_page_size(hstate_vma(vma))); return true; } static long madvise_dontneed_free(struct madvise_behavior *madv_behavior) { struct mm_struct *mm = madv_behavior->mm; struct madvise_behavior_range *range = &madv_behavior->range; int behavior = madv_behavior->behavior; if (!madvise_dontneed_free_valid_vma(madv_behavior)) return -EINVAL; if (range->start == range->end) return 0; if (!userfaultfd_remove(madv_behavior->vma, range->start, range->end)) { struct vm_area_struct *vma; mark_mmap_lock_dropped(madv_behavior); mmap_read_lock(mm); madv_behavior->vma = vma = vma_lookup(mm, range->start); if (!vma) return -ENOMEM; /* * Potential end adjustment for hugetlb vma is OK as * the check below keeps end within vma. */ if (!madvise_dontneed_free_valid_vma(madv_behavior)) return -EINVAL; if (range->end > vma->vm_end) { /* * Don't fail if end > vma->vm_end. If the old * vma was split while the mmap_lock was * released the effect of the concurrent * operation may not cause madvise() to * have an undefined result. There may be an * adjacent next vma that we'll walk * next. userfaultfd_remove() will generate an * UFFD_EVENT_REMOVE repetition on the * end-vma->vm_end range, but the manager can * handle a repetition fine. */ range->end = vma->vm_end; } /* * If the memory region between start and end was * originally backed by 4kB pages and then remapped to * be backed by hugepages while mmap_lock was dropped, * the adjustment for hugetlb vma above may have rounded * end down to the start address. */ if (range->start == range->end) return 0; VM_WARN_ON(range->start > range->end); } if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) return madvise_dontneed_single_vma(madv_behavior); else if (behavior == MADV_FREE) return madvise_free_single_vma(madv_behavior); else return -EINVAL; } static long madvise_populate(struct madvise_behavior *madv_behavior) { struct mm_struct *mm = madv_behavior->mm; const bool write = madv_behavior->behavior == MADV_POPULATE_WRITE; int locked = 1; unsigned long start = madv_behavior->range.start; unsigned long end = madv_behavior->range.end; long pages; while (start < end) { /* Populate (prefault) page tables readable/writable. */ pages = faultin_page_range(mm, start, end, write, &locked); if (!locked) { mmap_read_lock(mm); locked = 1; } if (pages < 0) { switch (pages) { case -EINTR: return -EINTR; case -EINVAL: /* Incompatible mappings / permissions. */ return -EINVAL; case -EHWPOISON: return -EHWPOISON; case -EFAULT: /* VM_FAULT_SIGBUS or VM_FAULT_SIGSEGV */ return -EFAULT; default: pr_warn_once("%s: unhandled return value: %ld\n", __func__, pages); fallthrough; case -ENOMEM: /* No VMA or out of memory. */ return -ENOMEM; } } start += pages * PAGE_SIZE; } return 0; } /* * Application wants to free up the pages and associated backing store. * This is effectively punching a hole into the middle of a file. */ static long madvise_remove(struct madvise_behavior *madv_behavior) { loff_t offset; int error; struct file *f; struct mm_struct *mm = madv_behavior->mm; struct vm_area_struct *vma = madv_behavior->vma; unsigned long start = madv_behavior->range.start; unsigned long end = madv_behavior->range.end; mark_mmap_lock_dropped(madv_behavior); if (vma->vm_flags & VM_LOCKED) return -EINVAL; f = vma->vm_file; if (!f || !f->f_mapping || !f->f_mapping->host) { return -EINVAL; } if (!vma_is_shared_maywrite(vma)) return -EACCES; offset = (loff_t)(start - vma->vm_start) + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); /* * Filesystem's fallocate may need to take i_rwsem. We need to * explicitly grab a reference because the vma (and hence the * vma's reference to the file) can go away as soon as we drop * mmap_lock. */ get_file(f); if (userfaultfd_remove(vma, start, end)) { /* mmap_lock was not released by userfaultfd_remove() */ mmap_read_unlock(mm); } error = vfs_fallocate(f, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, offset, end - start); fput(f); mmap_read_lock(mm); return error; } static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked) { vm_flags_t disallowed = VM_SPECIAL | VM_HUGETLB; /* * A user could lock after setting a guard range but that's fine, as * they'd not be able to fault in. The issue arises when we try to zap * existing locked VMAs. We don't want to do that. */ if (!allow_locked) disallowed |= VM_LOCKED; return !(vma->vm_flags & disallowed); } static bool is_guard_pte_marker(pte_t ptent) { return is_swap_pte(ptent) && is_guard_swp_entry(pte_to_swp_entry(ptent)); } static int guard_install_pud_entry(pud_t *pud, unsigned long addr, unsigned long next, struct mm_walk *walk) { pud_t pudval = pudp_get(pud); /* If huge return >0 so we abort the operation + zap. */ return pud_trans_huge(pudval); } static int guard_install_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk) { pmd_t pmdval = pmdp_get(pmd); /* If huge return >0 so we abort the operation + zap. */ return pmd_trans_huge(pmdval); } static int guard_install_pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk) { pte_t pteval = ptep_get(pte); unsigned long *nr_pages = (unsigned long *)walk->private; /* If there is already a guard page marker, we have nothing to do. */ if (is_guard_pte_marker(pteval)) { (*nr_pages)++; return 0; } /* If populated return >0 so we abort the operation + zap. */ return 1; } static int guard_install_set_pte(unsigned long addr, unsigned long next, pte_t *ptep, struct mm_walk *walk) { unsigned long *nr_pages = (unsigned long *)walk->private; /* Simply install a PTE marker, this causes segfault on access. */ *ptep = make_pte_marker(PTE_MARKER_GUARD); (*nr_pages)++; return 0; } static const struct mm_walk_ops guard_install_walk_ops = { .pud_entry = guard_install_pud_entry, .pmd_entry = guard_install_pmd_entry, .pte_entry = guard_install_pte_entry, .install_pte = guard_install_set_pte, .walk_lock = PGWALK_RDLOCK, }; static long madvise_guard_install(struct madvise_behavior *madv_behavior) { struct vm_area_struct *vma = madv_behavior->vma; struct madvise_behavior_range *range = &madv_behavior->range; long err; int i; if (!is_valid_guard_vma(vma, /* allow_locked = */false)) return -EINVAL; /* * If we install guard markers, then the range is no longer * empty from a page table perspective and therefore it's * appropriate to have an anon_vma. * * This ensures that on fork, we copy page tables correctly. */ err = anon_vma_prepare(vma); if (err) return err; /* * Optimistically try to install the guard marker pages first. If any * non-guard pages are encountered, give up and zap the range before * trying again. * * We try a few times before giving up and releasing back to userland to * loop around, releasing locks in the process to avoid contention. This * would only happen if there was a great many racing page faults. * * In most cases we should simply install the guard markers immediately * with no zap or looping. */ for (i = 0; i < MAX_MADVISE_GUARD_RETRIES; i++) { unsigned long nr_pages = 0; /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ err = walk_page_range_mm(vma->vm_mm, range->start, range->end, &guard_install_walk_ops, &nr_pages); if (err < 0) return err; if (err == 0) { unsigned long nr_expected_pages = PHYS_PFN(range->end - range->start); VM_WARN_ON(nr_pages != nr_expected_pages); return 0; } /* * OK some of the range have non-guard pages mapped, zap * them. This leaves existing guard pages in place. */ zap_page_range_single(vma, range->start, range->end - range->start, NULL); } /* * We were unable to install the guard pages due to being raced by page * faults. This should not happen ordinarily. We return to userspace and * immediately retry, relieving lock contention. */ return restart_syscall(); } static int guard_remove_pud_entry(pud_t *pud, unsigned long addr, unsigned long next, struct mm_walk *walk) { pud_t pudval = pudp_get(pud); /* If huge, cannot have guard pages present, so no-op - skip. */ if (pud_trans_huge(pudval)) walk->action = ACTION_CONTINUE; return 0; } static int guard_remove_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk) { pmd_t pmdval = pmdp_get(pmd); /* If huge, cannot have guard pages present, so no-op - skip. */ if (pmd_trans_huge(pmdval)) walk->action = ACTION_CONTINUE; return 0; } static int guard_remove_pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk) { pte_t ptent = ptep_get(pte); if (is_guard_pte_marker(ptent)) { /* Simply clear the PTE marker. */ pte_clear_not_present_full(walk->mm, addr, pte, false); update_mmu_cache(walk->vma, addr, pte); } return 0; } static const struct mm_walk_ops guard_remove_walk_ops = { .pud_entry = guard_remove_pud_entry, .pmd_entry = guard_remove_pmd_entry, .pte_entry = guard_remove_pte_entry, .walk_lock = PGWALK_RDLOCK, }; static long madvise_guard_remove(struct madvise_behavior *madv_behavior) { struct vm_area_struct *vma = madv_behavior->vma; struct madvise_behavior_range *range = &madv_behavior->range; /* * We're ok with removing guards in mlock()'d ranges, as this is a * non-destructive action. */ if (!is_valid_guard_vma(vma, /* allow_locked = */true)) return -EINVAL; return walk_page_range_vma(vma, range->start, range->end, &guard_remove_walk_ops, NULL); } #ifdef CONFIG_64BIT /* Does the madvise operation result in discarding of mapped data? */ static bool is_discard(int behavior) { switch (behavior) { case MADV_FREE: case MADV_DONTNEED: case MADV_DONTNEED_LOCKED: case MADV_REMOVE: case MADV_DONTFORK: case MADV_WIPEONFORK: case MADV_GUARD_INSTALL: return true; } return false; } /* * We are restricted from madvise()'ing mseal()'d VMAs only in very particular * circumstances - discarding of data from read-only anonymous SEALED mappings. * * This is because users cannot trivally discard data from these VMAs, and may * only do so via an appropriate madvise() call. */ static bool can_madvise_modify(struct madvise_behavior *madv_behavior) { struct vm_area_struct *vma = madv_behavior->vma; /* If the VMA isn't sealed we're good. */ if (!vma_is_sealed(vma)) return true; /* For a sealed VMA, we only care about discard operations. */ if (!is_discard(madv_behavior->behavior)) return true; /* * We explicitly permit all file-backed mappings, whether MAP_SHARED or * MAP_PRIVATE. * * The latter causes some complications. Because now, one can mmap() * read/write a MAP_PRIVATE mapping, write to it, then mprotect() * read-only, mseal() and a discard will be permitted. * * However, in order to avoid issues with potential use of madvise(..., * MADV_DONTNEED) of mseal()'d .text mappings we, for the time being, * permit this. */ if (!vma_is_anonymous(vma)) return true; /* If the user could write to the mapping anyway, then this is fine. */ if ((vma->vm_flags & VM_WRITE) && arch_vma_access_permitted(vma, /* write= */ true, /* execute= */ false, /* foreign= */ false)) return true; /* Otherwise, we are not permitted to perform this operation. */ return false; } #else static bool can_madvise_modify(struct madvise_behavior *madv_behavior) { return true; } #endif /* * Apply an madvise behavior to a region of a vma. madvise_update_vma * will handle splitting a vm area into separate areas, each area with its own * behavior. */ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior) { int behavior = madv_behavior->behavior; struct vm_area_struct *vma = madv_behavior->vma; vm_flags_t new_flags = vma->vm_flags; struct madvise_behavior_range *range = &madv_behavior->range; int error; if (unlikely(!can_madvise_modify(madv_behavior))) return -EPERM; switch (behavior) { case MADV_REMOVE: return madvise_remove(madv_behavior); case MADV_WILLNEED: return madvise_willneed(madv_behavior); case MADV_COLD: return madvise_cold(madv_behavior); case MADV_PAGEOUT: return madvise_pageout(madv_behavior); case MADV_FREE: case MADV_DONTNEED: case MADV_DONTNEED_LOCKED: return madvise_dontneed_free(madv_behavior); case MADV_COLLAPSE: return madvise_collapse(vma, range->start, range->end, &madv_behavior->lock_dropped); case MADV_GUARD_INSTALL: return madvise_guard_install(madv_behavior); case MADV_GUARD_REMOVE: return madvise_guard_remove(madv_behavior); /* The below behaviours update VMAs via madvise_update_vma(). */ case MADV_NORMAL: new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ; break; case MADV_SEQUENTIAL: new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ; break; case MADV_RANDOM: new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ; break; case MADV_DONTFORK: new_flags |= VM_DONTCOPY; break; case MADV_DOFORK: if (new_flags & VM_IO) return -EINVAL; new_flags &= ~VM_DONTCOPY; break; case MADV_WIPEONFORK: /* MADV_WIPEONFORK is only supported on anonymous memory. */ if (vma->vm_file || new_flags & VM_SHARED) return -EINVAL; new_flags |= VM_WIPEONFORK; break; case MADV_KEEPONFORK: if (new_flags & VM_DROPPABLE) return -EINVAL; new_flags &= ~VM_WIPEONFORK; break; case MADV_DONTDUMP: new_flags |= VM_DONTDUMP; break; case MADV_DODUMP: if ((!is_vm_hugetlb_page(vma) && (new_flags & VM_SPECIAL)) || (new_flags & VM_DROPPABLE)) return -EINVAL; new_flags &= ~VM_DONTDUMP; break; case MADV_MERGEABLE: case MADV_UNMERGEABLE: error = ksm_madvise(vma, range->start, range->end, behavior, &new_flags); if (error) goto out; break; case MADV_HUGEPAGE: case MADV_NOHUGEPAGE: error = hugepage_madvise(vma, &new_flags, behavior); if (error) goto out; break; case __MADV_SET_ANON_VMA_NAME: /* Only anonymous mappings can be named */ if (vma->vm_file && !vma_is_anon_shmem(vma)) return -EBADF; break; } /* This is a write operation.*/ VM_WARN_ON_ONCE(madv_behavior->lock_mode != MADVISE_MMAP_WRITE_LOCK); error = madvise_update_vma(new_flags, madv_behavior); out: /* * madvise() returns EAGAIN if kernel resources, such as * slab, are temporarily unavailable. */ if (error == -ENOMEM) error = -EAGAIN; return error; } #ifdef CONFIG_MEMORY_FAILURE /* * Error injection support for memory error handling. */ static int madvise_inject_error(struct madvise_behavior *madv_behavior) { unsigned long size; unsigned long start = madv_behavior->range.start; unsigned long end = madv_behavior->range.end; if (!capable(CAP_SYS_ADMIN)) return -EPERM; for (; start < end; start += size) { unsigned long pfn; struct page *page; int ret; ret = get_user_pages_fast(start, 1, 0, &page); if (ret != 1) return ret; pfn = page_to_pfn(page); /* * When soft offlining hugepages, after migrating the page * we dissolve it, therefore in the second loop "page" will * no longer be a compound page. */ size = page_size(compound_head(page)); if (madv_behavior->behavior == MADV_SOFT_OFFLINE) { pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n", pfn, start); ret = soft_offline_page(pfn, MF_COUNT_INCREASED); } else { pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n", pfn, start); ret = memory_failure(pfn, MF_ACTION_REQUIRED | MF_COUNT_INCREASED | MF_SW_SIMULATED); if (ret == -EOPNOTSUPP) ret = 0; } if (ret) return ret; } return 0; } static bool is_memory_failure(struct madvise_behavior *madv_behavior) { switch (madv_behavior->behavior) { case MADV_HWPOISON: case MADV_SOFT_OFFLINE: return true; default: return false; } } #else static int madvise_inject_error(struct madvise_behavior *madv_behavior) { return 0; } static bool is_memory_failure(struct madvise_behavior *madv_behavior) { return false; } #endif /* CONFIG_MEMORY_FAILURE */ static bool madvise_behavior_valid(int behavior) { switch (behavior) { case MADV_DOFORK: case MADV_DONTFORK: case MADV_NORMAL: case MADV_SEQUENTIAL: case MADV_RANDOM: case MADV_REMOVE: case MADV_WILLNEED: case MADV_DONTNEED: case MADV_DONTNEED_LOCKED: case MADV_FREE: case MADV_COLD: case MADV_PAGEOUT: case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: #ifdef CONFIG_KSM case MADV_MERGEABLE: case MADV_UNMERGEABLE: #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE case MADV_HUGEPAGE: case MADV_NOHUGEPAGE: case MADV_COLLAPSE: #endif case MADV_DONTDUMP: case MADV_DODUMP: case MADV_WIPEONFORK: case MADV_KEEPONFORK: case MADV_GUARD_INSTALL: case MADV_GUARD_REMOVE: #ifdef CONFIG_MEMORY_FAILURE case MADV_SOFT_OFFLINE: case MADV_HWPOISON: #endif return true; default: return false; } } /* Can we invoke process_madvise() on a remote mm for the specified behavior? */ static bool process_madvise_remote_valid(int behavior) { switch (behavior) { case MADV_COLD: case MADV_PAGEOUT: case MADV_WILLNEED: case MADV_COLLAPSE: return true; default: return false; } } /* * Try to acquire a VMA read lock if possible. * * We only support this lock over a single VMA, which the input range must * span either partially or fully. * * This function always returns with an appropriate lock held. If a VMA read * lock could be acquired, we return true and set madv_behavior state * accordingly. * * If a VMA read lock could not be acquired, we return false and expect caller to * fallback to mmap lock behaviour. */ static bool try_vma_read_lock(struct madvise_behavior *madv_behavior) { struct mm_struct *mm = madv_behavior->mm; struct vm_area_struct *vma; vma = lock_vma_under_rcu(mm, madv_behavior->range.start); if (!vma) goto take_mmap_read_lock; /* * Must span only a single VMA; uffd and remote processes are * unsupported. */ if (madv_behavior->range.end > vma->vm_end || current->mm != mm || userfaultfd_armed(vma)) { vma_end_read(vma); goto take_mmap_read_lock; } madv_behavior->vma = vma; return true; take_mmap_read_lock: mmap_read_lock(mm); madv_behavior->lock_mode = MADVISE_MMAP_READ_LOCK; return false; } /* * Walk the vmas in range [start,end), and call the madvise_vma_behavior * function on each one. The function will get start and end parameters that * cover the overlap between the current vma and the original range. Any * unmapped regions in the original range will result in this function returning * -ENOMEM while still calling the madvise_vma_behavior function on all of the * existing vmas in the range. Must be called with the mmap_lock held for * reading or writing. */ static int madvise_walk_vmas(struct madvise_behavior *madv_behavior) { struct mm_struct *mm = madv_behavior->mm; struct madvise_behavior_range *range = &madv_behavior->range; /* range is updated to span each VMA, so store end of entire range. */ unsigned long last_end = range->end; int unmapped_error = 0; int error; struct vm_area_struct *prev, *vma; /* * If VMA read lock is supported, apply madvise to a single VMA * tentatively, avoiding walking VMAs. */ if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK && try_vma_read_lock(madv_behavior)) { error = madvise_vma_behavior(madv_behavior); vma_end_read(madv_behavior->vma); return error; } vma = find_vma_prev(mm, range->start, &prev); if (vma && range->start > vma->vm_start) prev = vma; for (;;) { /* Still start < end. */ if (!vma) return -ENOMEM; /* Here start < (last_end|vma->vm_end). */ if (range->start < vma->vm_start) { /* * This indicates a gap between VMAs in the input * range. This does not cause the operation to abort, * rather we simply return -ENOMEM to indicate that this * has happened, but carry on. */ unmapped_error = -ENOMEM; range->start = vma->vm_start; if (range->start >= last_end) break; } /* Here vma->vm_start <= range->start < (last_end|vma->vm_end) */ range->end = min(vma->vm_end, last_end); /* Here vma->vm_start <= range->start < range->end <= (last_end|vma->vm_end). */ madv_behavior->prev = prev; madv_behavior->vma = vma; error = madvise_vma_behavior(madv_behavior); if (error) return error; if (madv_behavior->lock_dropped) { /* We dropped the mmap lock, we can't ref the VMA. */ prev = NULL; vma = NULL; madv_behavior->lock_dropped = false; } else { vma = madv_behavior->vma; prev = vma; } if (vma && range->end < vma->vm_end) range->end = vma->vm_end; if (range->end >= last_end) break; vma = find_vma(mm, vma ? vma->vm_end : range->end); range->start = range->end; } return unmapped_error; } /* * Any behaviour which results in changes to the vma->vm_flags needs to * take mmap_lock for writing. Others, which simply traverse vmas, need * to only take it for reading. */ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavior) { if (is_memory_failure(madv_behavior)) return MADVISE_NO_LOCK; switch (madv_behavior->behavior) { case MADV_REMOVE: case MADV_WILLNEED: case MADV_COLD: case MADV_PAGEOUT: case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: case MADV_COLLAPSE: case MADV_GUARD_INSTALL: case MADV_GUARD_REMOVE: return MADVISE_MMAP_READ_LOCK; case MADV_DONTNEED: case MADV_DONTNEED_LOCKED: case MADV_FREE: return MADVISE_VMA_READ_LOCK; default: return MADVISE_MMAP_WRITE_LOCK; } } static int madvise_lock(struct madvise_behavior *madv_behavior) { struct mm_struct *mm = madv_behavior->mm; enum madvise_lock_mode lock_mode = get_lock_mode(madv_behavior); switch (lock_mode) { case MADVISE_NO_LOCK: break; case MADVISE_MMAP_WRITE_LOCK: if (mmap_write_lock_killable(mm)) return -EINTR; break; case MADVISE_MMAP_READ_LOCK: mmap_read_lock(mm); break; case MADVISE_VMA_READ_LOCK: /* We will acquire the lock per-VMA in madvise_walk_vmas(). */ break; } madv_behavior->lock_mode = lock_mode; return 0; } static void madvise_unlock(struct madvise_behavior *madv_behavior) { struct mm_struct *mm = madv_behavior->mm; switch (madv_behavior->lock_mode) { case MADVISE_NO_LOCK: return; case MADVISE_MMAP_WRITE_LOCK: mmap_write_unlock(mm); break; case MADVISE_MMAP_READ_LOCK: mmap_read_unlock(mm); break; case MADVISE_VMA_READ_LOCK: /* We will drop the lock per-VMA in madvise_walk_vmas(). */ break; } madv_behavior->lock_mode = MADVISE_NO_LOCK; } static bool madvise_batch_tlb_flush(int behavior) { switch (behavior) { case MADV_DONTNEED: case MADV_DONTNEED_LOCKED: case MADV_FREE: return true; default: return false; } } static void madvise_init_tlb(struct madvise_behavior *madv_behavior) { if (madvise_batch_tlb_flush(madv_behavior->behavior)) tlb_gather_mmu(madv_behavior->tlb, madv_behavior->mm); } static void madvise_finish_tlb(struct madvise_behavior *madv_behavior) { if (madvise_batch_tlb_flush(madv_behavior->behavior)) tlb_finish_mmu(madv_behavior->tlb); } static bool is_valid_madvise(unsigned long start, size_t len_in, int behavior) { size_t len; if (!madvise_behavior_valid(behavior)) return false; if (!PAGE_ALIGNED(start)) return false; len = PAGE_ALIGN(len_in); /* Check to see whether len was rounded up from small -ve to zero */ if (len_in && !len) return false; if (start + len < start) return false; return true; } /* * madvise_should_skip() - Return if the request is invalid or nothing. * @start: Start address of madvise-requested address range. * @len_in: Length of madvise-requested address range. * @behavior: Requested madvise behavor. * @err: Pointer to store an error code from the check. * * If the specified behaviour is invalid or nothing would occur, we skip the * operation. This function returns true in the cases, otherwise false. In * the former case we store an error on @err. */ static bool madvise_should_skip(unsigned long start, size_t len_in, int behavior, int *err) { if (!is_valid_madvise(start, len_in, behavior)) { *err = -EINVAL; return true; } if (start + PAGE_ALIGN(len_in) == start) { *err = 0; return true; } return false; } static bool is_madvise_populate(struct madvise_behavior *madv_behavior) { switch (madv_behavior->behavior) { case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: return true; default: return false; } } /* * untagged_addr_remote() assumes mmap_lock is already held. On * architectures like x86 and RISC-V, tagging is tricky because each * mm may have a different tagging mask. However, we might only hold * the per-VMA lock (currently only local processes are supported), * so untagged_addr is used to avoid the mmap_lock assertion for * local processes. */ static inline unsigned long get_untagged_addr(struct mm_struct *mm, unsigned long start) { return current->mm == mm ? untagged_addr(start) : untagged_addr_remote(mm, start); } static int madvise_do_behavior(unsigned long start, size_t len_in, struct madvise_behavior *madv_behavior) { struct blk_plug plug; int error; struct madvise_behavior_range *range = &madv_behavior->range; if (is_memory_failure(madv_behavior)) { range->start = start; range->end = start + len_in; return madvise_inject_error(madv_behavior); } range->start = get_untagged_addr(madv_behavior->mm, start); range->end = range->start + PAGE_ALIGN(len_in); blk_start_plug(&plug); if (is_madvise_populate(madv_behavior)) error = madvise_populate(madv_behavior); else error = madvise_walk_vmas(madv_behavior); blk_finish_plug(&plug); return error; } /* * The madvise(2) system call. * * Applications can use madvise() to advise the kernel how it should * handle paging I/O in this VM area. The idea is to help the kernel * use appropriate read-ahead and caching techniques. The information * provided is advisory only, and can be safely disregarded by the * kernel without affecting the correct operation of the application. * * behavior values: * MADV_NORMAL - the default behavior is to read clusters. This * results in some read-ahead and read-behind. * MADV_RANDOM - the system should read the minimum amount of data * on any access, since it is unlikely that the appli- * cation will need more than what it asks for. * MADV_SEQUENTIAL - pages in the given range will probably be accessed * once, so they can be aggressively read ahead, and * can be freed soon after they are accessed. * MADV_WILLNEED - the application is notifying the system to read * some pages ahead. * MADV_DONTNEED - the application is finished with the given range, * so the kernel can free resources associated with it. * MADV_FREE - the application marks pages in the given range as lazy free, * where actual purges are postponed until memory pressure happens. * MADV_REMOVE - the application wants to free up the given range of * pages and associated backing store. * MADV_DONTFORK - omit this area from child's address space when forking: * typically, to avoid COWing pages pinned by get_user_pages(). * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking. * MADV_WIPEONFORK - present the child process with zero-filled memory in this * range after a fork. * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK * MADV_HWPOISON - trigger memory error handler as if the given memory range * were corrupted by unrecoverable hardware memory failure. * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory. * MADV_MERGEABLE - the application recommends that KSM try to merge pages in * this area with pages of identical content from other such areas. * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others. * MADV_HUGEPAGE - the application wants to back the given range by transparent * huge pages in the future. Existing pages might be coalesced and * new pages might be allocated as THP. * MADV_NOHUGEPAGE - mark the given range as not worth being backed by * transparent huge pages so the existing pages will not be * coalesced into THP and new pages will not be allocated as THP. * MADV_COLLAPSE - synchronously coalesce pages into new THP. * MADV_DONTDUMP - the application wants to prevent pages in the given range * from being included in its core dump. * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. * MADV_COLD - the application is not expected to use this memory soon, * deactivate pages in this range so that they can be reclaimed * easily if memory pressure happens. * MADV_PAGEOUT - the application is not expected to use this memory soon, * page out the pages in this range immediately. * MADV_POPULATE_READ - populate (prefault) page tables readable by * triggering read faults if required * MADV_POPULATE_WRITE - populate (prefault) page tables writable by * triggering write faults if required * * return values: * zero - success * -EINVAL - start + len < 0, start is not page-aligned, * "behavior" is not a valid value, or application * is attempting to release locked or shared pages, * or the specified address range includes file, Huge TLB, * MAP_SHARED or VMPFNMAP range. * -ENOMEM - addresses in the specified range are not currently * mapped, or are outside the AS of the process. * -EIO - an I/O error occurred while paging in data. * -EBADF - map exists, but area maps something that isn't a file. * -EAGAIN - a kernel resource was temporarily unavailable. * -EPERM - memory is sealed. */ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) { int error; struct mmu_gather tlb; struct madvise_behavior madv_behavior = { .mm = mm, .behavior = behavior, .tlb = &tlb, }; if (madvise_should_skip(start, len_in, behavior, &error)) return error; error = madvise_lock(&madv_behavior); if (error) return error; madvise_init_tlb(&madv_behavior); error = madvise_do_behavior(start, len_in, &madv_behavior); madvise_finish_tlb(&madv_behavior); madvise_unlock(&madv_behavior); return error; } SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) { return do_madvise(current->mm, start, len_in, behavior); } /* Perform an madvise operation over a vector of addresses and lengths. */ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, int behavior) { ssize_t ret = 0; size_t total_len; struct mmu_gather tlb; struct madvise_behavior madv_behavior = { .mm = mm, .behavior = behavior, .tlb = &tlb, }; total_len = iov_iter_count(iter); ret = madvise_lock(&madv_behavior); if (ret) return ret; madvise_init_tlb(&madv_behavior); while (iov_iter_count(iter)) { unsigned long start = (unsigned long)iter_iov_addr(iter); size_t len_in = iter_iov_len(iter); int error; if (madvise_should_skip(start, len_in, behavior, &error)) ret = error; else ret = madvise_do_behavior(start, len_in, &madv_behavior); /* * An madvise operation is attempting to restart the syscall, * but we cannot proceed as it would not be correct to repeat * the operation in aggregate, and would be surprising to the * user. * * We drop and reacquire locks so it is safe to just loop and * try again. We check for fatal signals in case we need exit * early anyway. */ if (ret == -ERESTARTNOINTR) { if (fatal_signal_pending(current)) { ret = -EINTR; break; } /* Drop and reacquire lock to unwind race. */ madvise_finish_tlb(&madv_behavior); madvise_unlock(&madv_behavior); ret = madvise_lock(&madv_behavior); if (ret) goto out; madvise_init_tlb(&madv_behavior); continue; } if (ret < 0) break; iov_iter_advance(iter, iter_iov_len(iter)); } madvise_finish_tlb(&madv_behavior); madvise_unlock(&madv_behavior); out: ret = (total_len - iov_iter_count(iter)) ? : ret; return ret; } SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, size_t, vlen, int, behavior, unsigned int, flags) { ssize_t ret; struct iovec iovstack[UIO_FASTIOV]; struct iovec *iov = iovstack; struct iov_iter iter; struct task_struct *task; struct mm_struct *mm; unsigned int f_flags; if (flags != 0) { ret = -EINVAL; goto out; } ret = import_iovec(ITER_DEST, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter); if (ret < 0) goto out; task = pidfd_get_task(pidfd, &f_flags); if (IS_ERR(task)) { ret = PTR_ERR(task); goto free_iov; } /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */ mm = mm_access(task, PTRACE_MODE_READ_FSCREDS); if (IS_ERR(mm)) { ret = PTR_ERR(mm); goto release_task; } /* * We need only perform this check if we are attempting to manipulate a * remote process's address space. */ if (mm != current->mm && !process_madvise_remote_valid(behavior)) { ret = -EINVAL; goto release_mm; } /* * Require CAP_SYS_NICE for influencing process performance. Note that * only non-destructive hints are currently supported for remote * processes. */ if (mm != current->mm && !capable(CAP_SYS_NICE)) { ret = -EPERM; goto release_mm; } ret = vector_madvise(mm, &iter, behavior); release_mm: mmput(mm); release_task: put_task_struct(task); free_iov: kfree(iov); out: return ret; } #ifdef CONFIG_ANON_VMA_NAME #define ANON_VMA_NAME_MAX_LEN 80 #define ANON_VMA_NAME_INVALID_CHARS "\\`$[]" static inline bool is_valid_name_char(char ch) { /* printable ascii characters, excluding ANON_VMA_NAME_INVALID_CHARS */ return ch > 0x1f && ch < 0x7f && !strchr(ANON_VMA_NAME_INVALID_CHARS, ch); } static int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, unsigned long len_in, struct anon_vma_name *anon_name) { unsigned long end; unsigned long len; int error; struct madvise_behavior madv_behavior = { .mm = mm, .behavior = __MADV_SET_ANON_VMA_NAME, .anon_name = anon_name, }; if (start & ~PAGE_MASK) return -EINVAL; len = (len_in + ~PAGE_MASK) & PAGE_MASK; /* Check to see whether len was rounded up from small -ve to zero */ if (len_in && !len) return -EINVAL; end = start + len; if (end < start) return -EINVAL; if (end == start) return 0; madv_behavior.range.start = start; madv_behavior.range.end = end; error = madvise_lock(&madv_behavior); if (error) return error; error = madvise_walk_vmas(&madv_behavior); madvise_unlock(&madv_behavior); return error; } int set_anon_vma_name(unsigned long addr, unsigned long size, const char __user *uname) { struct anon_vma_name *anon_name = NULL; struct mm_struct *mm = current->mm; int error; if (uname) { char *name, *pch; name = strndup_user(uname, ANON_VMA_NAME_MAX_LEN); if (IS_ERR(name)) return PTR_ERR(name); for (pch = name; *pch != '\0'; pch++) { if (!is_valid_name_char(*pch)) { kfree(name); return -EINVAL; } } /* anon_vma has its own copy */ anon_name = anon_vma_name_alloc(name); kfree(name); if (!anon_name) return -ENOMEM; } error = madvise_set_anon_name(mm, addr, size, anon_name); anon_vma_name_put(anon_name); return error; } #endif |
| 5 5 5 5 5 5 13 13 13 13 4 4 4 4 4 1 3 4 1 4 37 38 37 37 38 38 14 39 39 1 39 43 41 40 29 43 139 139 9 325 324 148 147 16 131 148 148 141 141 139 8 8 140 138 139 327 307 5 5 3 2 2 13 11 13 23 6 5 4 21 21 21 23 20 15 14 13 13 13 1 1 1 1 1 1 2 4 4 4 4 3 4 2 2 2 2 2 2 2 27 4 1 2 1 1 1 1 1 1 7 3 6 4 4 2 2 27 4 8 62 61 59 23 43 40 12 38 59 4 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 | // SPDX-License-Identifier: GPL-2.0-only /* * linux/kernel/ptrace.c * * (C) Copyright 1999 Linus Torvalds * * Common interfaces for "ptrace()" which we do not want * to continually duplicate across every architecture. */ #include <linux/capability.h> #include <linux/export.h> #include <linux/sched.h> #include <linux/sched/mm.h> #include <linux/sched/coredump.h> #include <linux/sched/task.h> #include <linux/errno.h> #include <linux/mm.h> #include <linux/highmem.h> #include <linux/pagemap.h> #include <linux/ptrace.h> #include <linux/security.h> #include <linux/signal.h> #include <linux/uio.h> #include <linux/audit.h> #include <linux/pid_namespace.h> #include <linux/syscalls.h> #include <linux/uaccess.h> #include <linux/regset.h> #include <linux/hw_breakpoint.h> #include <linux/cn_proc.h> #include <linux/compat.h> #include <linux/sched/signal.h> #include <linux/minmax.h> #include <linux/syscall_user_dispatch.h> #include <asm/syscall.h> /* for syscall_get_* */ /* * Access another process' address space via ptrace. * Source/target buffer must be kernel space, * Do not walk the page table directly, use get_user_pages */ int ptrace_access_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, unsigned int gup_flags) { struct mm_struct *mm; int ret; mm = get_task_mm(tsk); if (!mm) return 0; if (!tsk->ptrace || (current != tsk->parent) || ((get_dumpable(mm) != SUID_DUMP_USER) && !ptracer_capable(tsk, mm->user_ns))) { mmput(mm); return 0; } ret = access_remote_vm(mm, addr, buf, len, gup_flags); mmput(mm); return ret; } void __ptrace_link(struct task_struct *child, struct task_struct *new_parent, const struct cred *ptracer_cred) { BUG_ON(!list_empty(&child->ptrace_entry)); list_add(&child->ptrace_entry, &new_parent->ptraced); child->parent = new_parent; child->ptracer_cred = get_cred(ptracer_cred); } /* * ptrace a task: make the debugger its new parent and * move it to the ptrace list. * * Must be called with the tasklist lock write-held. */ static void ptrace_link(struct task_struct *child, struct task_struct *new_parent) { __ptrace_link(child, new_parent, current_cred()); } /** * __ptrace_unlink - unlink ptracee and restore its execution state * @child: ptracee to be unlinked * * Remove @child from the ptrace list, move it back to the original parent, * and restore the execution state so that it conforms to the group stop * state. * * Unlinking can happen via two paths - explicit PTRACE_DETACH or ptracer * exiting. For PTRACE_DETACH, unless the ptracee has been killed between * ptrace_check_attach() and here, it's guaranteed to be in TASK_TRACED. * If the ptracer is exiting, the ptracee can be in any state. * * After detach, the ptracee should be in a state which conforms to the * group stop. If the group is stopped or in the process of stopping, the * ptracee should be put into TASK_STOPPED; otherwise, it should be woken * up from TASK_TRACED. * * If the ptracee is in TASK_TRACED and needs to be moved to TASK_STOPPED, * it goes through TRACED -> RUNNING -> STOPPED transition which is similar * to but in the opposite direction of what happens while attaching to a * stopped task. However, in this direction, the intermediate RUNNING * state is not hidden even from the current ptracer and if it immediately * re-attaches and performs a WNOHANG wait(2), it may fail. * * CONTEXT: * write_lock_irq(tasklist_lock) */ void __ptrace_unlink(struct task_struct *child) { const struct cred *old_cred; BUG_ON(!child->ptrace); clear_task_syscall_work(child, SYSCALL_TRACE); #if defined(CONFIG_GENERIC_ENTRY) || defined(TIF_SYSCALL_EMU) clear_task_syscall_work(child, SYSCALL_EMU); #endif child->parent = child->real_parent; list_del_init(&child->ptrace_entry); old_cred = child->ptracer_cred; child->ptracer_cred = NULL; put_cred(old_cred); spin_lock(&child->sighand->siglock); child->ptrace = 0; /* * Clear all pending traps and TRAPPING. TRAPPING should be * cleared regardless of JOBCTL_STOP_PENDING. Do it explicitly. */ task_clear_jobctl_pending(child, JOBCTL_TRAP_MASK); task_clear_jobctl_trapping(child); /* * Reinstate JOBCTL_STOP_PENDING if group stop is in effect and * @child isn't dead. */ if (!(child->flags & PF_EXITING) && (child->signal->flags & SIGNAL_STOP_STOPPED || child->signal->group_stop_count)) child->jobctl |= JOBCTL_STOP_PENDING; /* * If transition to TASK_STOPPED is pending or in TASK_TRACED, kick * @child in the butt. Note that @resume should be used iff @child * is in TASK_TRACED; otherwise, we might unduly disrupt * TASK_KILLABLE sleeps. */ if (child->jobctl & JOBCTL_STOP_PENDING || task_is_traced(child)) ptrace_signal_wake_up(child, true); spin_unlock(&child->sighand->siglock); } static bool looks_like_a_spurious_pid(struct task_struct *task) { if (task->exit_code != ((PTRACE_EVENT_EXEC << 8) | SIGTRAP)) return false; if (task_pid_vnr(task) == task->ptrace_message) return false; /* * The tracee changed its pid but the PTRACE_EVENT_EXEC event * was not wait()'ed, most probably debugger targets the old * leader which was destroyed in de_thread(). */ return true; } /* * Ensure that nothing can wake it up, even SIGKILL * * A task is switched to this state while a ptrace operation is in progress; * such that the ptrace operation is uninterruptible. */ static bool ptrace_freeze_traced(struct task_struct *task) { bool ret = false; /* Lockless, nobody but us can set this flag */ if (task->jobctl & JOBCTL_LISTENING) return ret; spin_lock_irq(&task->sighand->siglock); if (task_is_traced(task) && !looks_like_a_spurious_pid(task) && !__fatal_signal_pending(task)) { task->jobctl |= JOBCTL_PTRACE_FROZEN; ret = true; } spin_unlock_irq(&task->sighand->siglock); return ret; } static void ptrace_unfreeze_traced(struct task_struct *task) { unsigned long flags; /* * The child may be awake and may have cleared * JOBCTL_PTRACE_FROZEN (see ptrace_resume). The child will * not set JOBCTL_PTRACE_FROZEN or enter __TASK_TRACED anew. */ if (lock_task_sighand(task, &flags)) { task->jobctl &= ~JOBCTL_PTRACE_FROZEN; if (__fatal_signal_pending(task)) { task->jobctl &= ~JOBCTL_TRACED; wake_up_state(task, __TASK_TRACED); } unlock_task_sighand(task, &flags); } } /** * ptrace_check_attach - check whether ptracee is ready for ptrace operation * @child: ptracee to check for * @ignore_state: don't check whether @child is currently %TASK_TRACED * * Check whether @child is being ptraced by %current and ready for further * ptrace operations. If @ignore_state is %false, @child also should be in * %TASK_TRACED state and on return the child is guaranteed to be traced * and not executing. If @ignore_state is %true, @child can be in any * state. * * CONTEXT: * Grabs and releases tasklist_lock and @child->sighand->siglock. * * RETURNS: * 0 on success, -ESRCH if %child is not ready. */ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) { int ret = -ESRCH; /* * We take the read lock around doing both checks to close a * possible race where someone else was tracing our child and * detached between these two checks. After this locked check, * we are sure that this is our traced child and that can only * be changed by us so it's not changing right after this. */ read_lock(&tasklist_lock); if (child->ptrace && child->parent == current) { /* * child->sighand can't be NULL, release_task() * does ptrace_unlink() before __exit_signal(). */ if (ignore_state || ptrace_freeze_traced(child)) ret = 0; } read_unlock(&tasklist_lock); if (!ret && !ignore_state && WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED|TASK_FROZEN))) ret = -ESRCH; return ret; } static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode) { if (mode & PTRACE_MODE_NOAUDIT) return ns_capable_noaudit(ns, CAP_SYS_PTRACE); return ns_capable(ns, CAP_SYS_PTRACE); } /* Returns 0 on success, -errno on denial. */ static int __ptrace_may_access(struct task_struct *task, unsigned int mode) { const struct cred *cred = current_cred(), *tcred; struct mm_struct *mm; kuid_t caller_uid; kgid_t caller_gid; if (!(mode & PTRACE_MODE_FSCREDS) == !(mode & PTRACE_MODE_REALCREDS)) { WARN(1, "denying ptrace access check without PTRACE_MODE_*CREDS\n"); return -EPERM; } /* May we inspect the given task? * This check is used both for attaching with ptrace * and for allowing access to sensitive information in /proc. * * ptrace_attach denies several cases that /proc allows * because setting up the necessary parent/child relationship * or halting the specified task is impossible. */ /* Don't let security modules deny introspection */ if (same_thread_group(task, current)) return 0; rcu_read_lock(); if (mode & PTRACE_MODE_FSCREDS) { caller_uid = cred->fsuid; caller_gid = cred->fsgid; } else { /* * Using the euid would make more sense here, but something * in userland might rely on the old behavior, and this * shouldn't be a security problem since * PTRACE_MODE_REALCREDS implies that the caller explicitly * used a syscall that requests access to another process * (and not a filesystem syscall to procfs). */ caller_uid = cred->uid; caller_gid = cred->gid; } tcred = __task_cred(task); if (uid_eq(caller_uid, tcred->euid) && uid_eq(caller_uid, tcred->suid) && uid_eq(caller_uid, tcred->uid) && gid_eq(caller_gid, tcred->egid) && gid_eq(caller_gid, tcred->sgid) && gid_eq(caller_gid, tcred->gid)) goto ok; if (ptrace_has_cap(tcred->user_ns, mode)) goto ok; rcu_read_unlock(); return -EPERM; ok: rcu_read_unlock(); /* * If a task drops privileges and becomes nondumpable (through a syscall * like setresuid()) while we are trying to access it, we must ensure * that the dumpability is read after the credentials; otherwise, * we may be able to attach to a task that we shouldn't be able to * attach to (as if the task had dropped privileges without becoming * nondumpable). * Pairs with a write barrier in commit_creds(). */ smp_rmb(); mm = task->mm; if (mm && ((get_dumpable(mm) != SUID_DUMP_USER) && !ptrace_has_cap(mm->user_ns, mode))) return -EPERM; return security_ptrace_access_check(task, mode); } bool ptrace_may_access(struct task_struct *task, unsigned int mode) { int err; task_lock(task); err = __ptrace_may_access(task, mode); task_unlock(task); return !err; } static int check_ptrace_options(unsigned long data) { if (data & ~(unsigned long)PTRACE_O_MASK) return -EINVAL; if (unlikely(data & PTRACE_O_SUSPEND_SECCOMP)) { if (!IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) || !IS_ENABLED(CONFIG_SECCOMP)) return -EINVAL; if (!capable(CAP_SYS_ADMIN)) return -EPERM; if (seccomp_mode(¤t->seccomp) != SECCOMP_MODE_DISABLED || current->ptrace & PT_SUSPEND_SECCOMP) return -EPERM; } return 0; } static inline void ptrace_set_stopped(struct task_struct *task, bool seize) { guard(spinlock)(&task->sighand->siglock); /* SEIZE doesn't trap tracee on attach */ if (!seize) send_signal_locked(SIGSTOP, SEND_SIG_PRIV, task, PIDTYPE_PID); /* * If the task is already STOPPED, set JOBCTL_TRAP_STOP and * TRAPPING, and kick it so that it transits to TRACED. TRAPPING * will be cleared if the child completes the transition or any * event which clears the group stop states happens. We'll wait * for the transition to complete before returning from this * function. * * This hides STOPPED -> RUNNING -> TRACED transition from the * attaching thread but a different thread in the same group can * still observe the transient RUNNING state. IOW, if another * thread's WNOHANG wait(2) on the stopped tracee races against * ATTACH, the wait(2) may fail due to the transient RUNNING. * * The following task_is_stopped() test is safe as both transitions * in and out of STOPPED are protected by siglock. */ if (task_is_stopped(task) && task_set_jobctl_pending(task, JOBCTL_TRAP_STOP | JOBCTL_TRAPPING)) { task->jobctl &= ~JOBCTL_STOPPED; signal_wake_up_state(task, __TASK_STOPPED); } } static int ptrace_attach(struct task_struct *task, long request, unsigned long addr, unsigned long flags) { bool seize = (request == PTRACE_SEIZE); int retval; if (seize) { if (addr != 0) return -EIO; /* * This duplicates the check in check_ptrace_options() because * ptrace_attach() and ptrace_setoptions() have historically * used different error codes for unknown ptrace options. */ if (flags & ~(unsigned long)PTRACE_O_MASK) return -EIO; retval = check_ptrace_options(flags); if (retval) return retval; flags = PT_PTRACED | PT_SEIZED | (flags << PT_OPT_FLAG_SHIFT); } else { flags = PT_PTRACED; } audit_ptrace(task); if (unlikely(task->flags & PF_KTHREAD)) return -EPERM; if (same_thread_group(task, current)) return -EPERM; /* * Protect exec's credential calculations against our interference; * SUID, SGID and LSM creds get determined differently * under ptrace. */ scoped_cond_guard (mutex_intr, return -ERESTARTNOINTR, &task->signal->cred_guard_mutex) { scoped_guard (task_lock, task) { retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS); if (retval) return retval; } scoped_guard (write_lock_irq, &tasklist_lock) { if (unlikely(task->exit_state)) return -EPERM; if (task->ptrace) return -EPERM; task->ptrace = flags; ptrace_link(task, current); ptrace_set_stopped(task, seize); } } /* * We do not bother to change retval or clear JOBCTL_TRAPPING * if wait_on_bit() was interrupted by SIGKILL. The tracer will * not return to user-mode, it will exit and clear this bit in * __ptrace_unlink() if it wasn't already cleared by the tracee; * and until then nobody can ptrace this task. */ wait_on_bit(&task->jobctl, JOBCTL_TRAPPING_BIT, TASK_KILLABLE); proc_ptrace_connector(task, PTRACE_ATTACH); return 0; } /** * ptrace_traceme -- helper for PTRACE_TRACEME * * Performs checks and sets PT_PTRACED. * Should be used by all ptrace implementations for PTRACE_TRACEME. */ static int ptrace_traceme(void) { int ret = -EPERM; write_lock_irq(&tasklist_lock); /* Are we already being traced? */ if (!current->ptrace) { ret = security_ptrace_traceme(current->parent); /* * Check PF_EXITING to ensure ->real_parent has not passed * exit_ptrace(). Otherwise we don't report the error but * pretend ->real_parent untraces us right after return. */ if (!ret && !(current->real_parent->flags & PF_EXITING)) { current->ptrace = PT_PTRACED; ptrace_link(current, current->real_parent); } } write_unlock_irq(&tasklist_lock); return ret; } /* * Called with irqs disabled, returns true if childs should reap themselves. */ static int ignoring_children(struct sighand_struct *sigh) { int ret; spin_lock(&sigh->siglock); ret = (sigh->action[SIGCHLD-1].sa.sa_handler == SIG_IGN) || (sigh->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT); spin_unlock(&sigh->siglock); return ret; } /* * Called with tasklist_lock held for writing. * Unlink a traced task, and clean it up if it was a traced zombie. * Return true if it needs to be reaped with release_task(). * (We can't call release_task() here because we already hold tasklist_lock.) * * If it's a zombie, our attachedness prevented normal parent notification * or self-reaping. Do notification now if it would have happened earlier. * If it should reap itself, return true. * * If it's our own child, there is no notification to do. But if our normal * children self-reap, then this child was prevented by ptrace and we must * reap it now, in that case we must also wake up sub-threads sleeping in * do_wait(). */ static bool __ptrace_detach(struct task_struct *tracer, struct task_struct *p) { bool dead; __ptrace_unlink(p); if (p->exit_state != EXIT_ZOMBIE) return false; dead = !thread_group_leader(p); if (!dead && thread_group_empty(p)) { if (!same_thread_group(p->real_parent, tracer)) dead = do_notify_parent(p, p->exit_signal); else if (ignoring_children(tracer->sighand)) { __wake_up_parent(p, tracer); dead = true; } } /* Mark it as in the process of being reaped. */ if (dead) p->exit_state = EXIT_DEAD; return dead; } static int ptrace_detach(struct task_struct *child, unsigned int data) { if (!valid_signal(data)) return -EIO; /* Architecture-specific hardware disable .. */ ptrace_disable(child); write_lock_irq(&tasklist_lock); /* * We rely on ptrace_freeze_traced(). It can't be killed and * untraced by another thread, it can't be a zombie. */ WARN_ON(!child->ptrace || child->exit_state); /* * tasklist_lock avoids the race with wait_task_stopped(), see * the comment in ptrace_resume(). */ child->exit_code = data; __ptrace_detach(current, child); write_unlock_irq(&tasklist_lock); proc_ptrace_connector(child, PTRACE_DETACH); return 0; } /* * Detach all tasks we were using ptrace on. Called with tasklist held * for writing. */ void exit_ptrace(struct task_struct *tracer, struct list_head *dead) { struct task_struct *p, *n; list_for_each_entry_safe(p, n, &tracer->ptraced, ptrace_entry) { if (unlikely(p->ptrace & PT_EXITKILL)) send_sig_info(SIGKILL, SEND_SIG_PRIV, p); if (__ptrace_detach(tracer, p)) list_add(&p->ptrace_entry, dead); } } int ptrace_readdata(struct task_struct *tsk, unsigned long src, char __user *dst, int len) { int copied = 0; while (len > 0) { char buf[128]; int this_len, retval; this_len = (len > sizeof(buf)) ? sizeof(buf) : len; retval = ptrace_access_vm(tsk, src, buf, this_len, FOLL_FORCE); if (!retval) { if (copied) break; return -EIO; } if (copy_to_user(dst, buf, retval)) return -EFAULT; copied += retval; src += retval; dst += retval; len -= retval; } return copied; } int ptrace_writedata(struct task_struct *tsk, char __user *src, unsigned long dst, int len) { int copied = 0; while (len > 0) { char buf[128]; int this_len, retval; this_len = (len > sizeof(buf)) ? sizeof(buf) : len; if (copy_from_user(buf, src, this_len)) return -EFAULT; retval = ptrace_access_vm(tsk, dst, buf, this_len, FOLL_FORCE | FOLL_WRITE); if (!retval) { if (copied) break; return -EIO; } copied += retval; src += retval; dst += retval; len -= retval; } return copied; } static int ptrace_setoptions(struct task_struct *child, unsigned long data) { unsigned flags; int ret; ret = check_ptrace_options(data); if (ret) return ret; /* Avoid intermediate state when all opts are cleared */ flags = child->ptrace; flags &= ~(PTRACE_O_MASK << PT_OPT_FLAG_SHIFT); flags |= (data << PT_OPT_FLAG_SHIFT); child->ptrace = flags; return 0; } static int ptrace_getsiginfo(struct task_struct *child, kernel_siginfo_t *info) { unsigned long flags; int error = -ESRCH; if (lock_task_sighand(child, &flags)) { error = -EINVAL; if (likely(child->last_siginfo != NULL)) { copy_siginfo(info, child->last_siginfo); error = 0; } unlock_task_sighand(child, &flags); } return error; } static int ptrace_setsiginfo(struct task_struct *child, const kernel_siginfo_t *info) { unsigned long flags; int error = -ESRCH; if (lock_task_sighand(child, &flags)) { error = -EINVAL; if (likely(child->last_siginfo != NULL)) { copy_siginfo(child->last_siginfo, info); error = 0; } unlock_task_sighand(child, &flags); } return error; } static int ptrace_peek_siginfo(struct task_struct *child, unsigned long addr, unsigned long data) { struct ptrace_peeksiginfo_args arg; struct sigpending *pending; struct sigqueue *q; int ret, i; ret = copy_from_user(&arg, (void __user *) addr, sizeof(struct ptrace_peeksiginfo_args)); if (ret) return -EFAULT; if (arg.flags & ~PTRACE_PEEKSIGINFO_SHARED) return -EINVAL; /* unknown flags */ if (arg.nr < 0) return -EINVAL; /* Ensure arg.off fits in an unsigned long */ if (arg.off > ULONG_MAX) return 0; if (arg.flags & PTRACE_PEEKSIGINFO_SHARED) pending = &child->signal->shared_pending; else pending = &child->pending; for (i = 0; i < arg.nr; ) { kernel_siginfo_t info; unsigned long off = arg.off + i; bool found = false; spin_lock_irq(&child->sighand->siglock); list_for_each_entry(q, &pending->list, list) { if (!off--) { found = true; copy_siginfo(&info, &q->info); break; } } spin_unlock_irq(&child->sighand->siglock); if (!found) /* beyond the end of the list */ break; #ifdef CONFIG_COMPAT if (unlikely(in_compat_syscall())) { compat_siginfo_t __user *uinfo = compat_ptr(data); if (copy_siginfo_to_user32(uinfo, &info)) { ret = -EFAULT; break; } } else #endif { siginfo_t __user *uinfo = (siginfo_t __user *) data; if (copy_siginfo_to_user(uinfo, &info)) { ret = -EFAULT; break; } } data += sizeof(siginfo_t); i++; if (signal_pending(current)) break; cond_resched(); } if (i > 0) return i; return ret; } #ifdef CONFIG_RSEQ static long ptrace_get_rseq_configuration(struct task_struct *task, unsigned long size, void __user *data) { struct ptrace_rseq_configuration conf = { .rseq_abi_pointer = (u64)(uintptr_t)task->rseq, .rseq_abi_size = task->rseq_len, .signature = task->rseq_sig, .flags = 0, }; size = min_t(unsigned long, size, sizeof(conf)); if (copy_to_user(data, &conf, size)) return -EFAULT; return sizeof(conf); } #endif #define is_singlestep(request) ((request) == PTRACE_SINGLESTEP) #ifdef PTRACE_SINGLEBLOCK #define is_singleblock(request) ((request) == PTRACE_SINGLEBLOCK) #else #define is_singleblock(request) 0 #endif #ifdef PTRACE_SYSEMU #define is_sysemu_singlestep(request) ((request) == PTRACE_SYSEMU_SINGLESTEP) #else #define is_sysemu_singlestep(request) 0 #endif static int ptrace_resume(struct task_struct *child, long request, unsigned long data) { if (!valid_signal(data)) return -EIO; if (request == PTRACE_SYSCALL) set_task_syscall_work(child, SYSCALL_TRACE); else clear_task_syscall_work(child, SYSCALL_TRACE); #if defined(CONFIG_GENERIC_ENTRY) || defined(TIF_SYSCALL_EMU) if (request == PTRACE_SYSEMU || request == PTRACE_SYSEMU_SINGLESTEP) set_task_syscall_work(child, SYSCALL_EMU); else clear_task_syscall_work(child, SYSCALL_EMU); #endif if (is_singleblock(request)) { if (unlikely(!arch_has_block_step())) return -EIO; user_enable_block_step(child); } else if (is_singlestep(request) || is_sysemu_singlestep(request)) { if (unlikely(!arch_has_single_step())) return -EIO; user_enable_single_step(child); } else { user_disable_single_step(child); } /* * Change ->exit_code and ->state under siglock to avoid the race * with wait_task_stopped() in between; a non-zero ->exit_code will * wrongly look like another report from tracee. * * Note that we need siglock even if ->exit_code == data and/or this * status was not reported yet, the new status must not be cleared by * wait_task_stopped() after resume. */ spin_lock_irq(&child->sighand->siglock); child->exit_code = data; child->jobctl &= ~JOBCTL_TRACED; wake_up_state(child, __TASK_TRACED); spin_unlock_irq(&child->sighand->siglock); return 0; } #ifdef CONFIG_HAVE_ARCH_TRACEHOOK static const struct user_regset * find_regset(const struct user_regset_view *view, unsigned int type) { const struct user_regset *regset; int n; for (n = 0; n < view->n; ++n) { regset = view->regsets + n; if (regset->core_note_type == type) return regset; } return NULL; } static int ptrace_regset(struct task_struct *task, int req, unsigned int type, struct iovec *kiov) { const struct user_regset_view *view = task_user_regset_view(task); const struct user_regset *regset = find_regset(view, type); int regset_no; if (!regset || (kiov->iov_len % regset->size) != 0) return -EINVAL; regset_no = regset - view->regsets; kiov->iov_len = min(kiov->iov_len, (__kernel_size_t) (regset->n * regset->size)); if (req == PTRACE_GETREGSET) return copy_regset_to_user(task, view, regset_no, 0, kiov->iov_len, kiov->iov_base); else return copy_regset_from_user(task, view, regset_no, 0, kiov->iov_len, kiov->iov_base); } /* * This is declared in linux/regset.h and defined in machine-dependent * code. We put the export here, near the primary machine-neutral use, * to ensure no machine forgets it. */ EXPORT_SYMBOL_GPL(task_user_regset_view); static unsigned long ptrace_get_syscall_info_entry(struct task_struct *child, struct pt_regs *regs, struct ptrace_syscall_info *info) { unsigned long args[ARRAY_SIZE(info->entry.args)]; int i; info->entry.nr = syscall_get_nr(child, regs); syscall_get_arguments(child, regs, args); for (i = 0; i < ARRAY_SIZE(args); i++) info->entry.args[i] = args[i]; /* args is the last field in struct ptrace_syscall_info.entry */ return offsetofend(struct ptrace_syscall_info, entry.args); } static unsigned long ptrace_get_syscall_info_seccomp(struct task_struct *child, struct pt_regs *regs, struct ptrace_syscall_info *info) { /* * As struct ptrace_syscall_info.entry is currently a subset * of struct ptrace_syscall_info.seccomp, it makes sense to * initialize that subset using ptrace_get_syscall_info_entry(). * This can be reconsidered in the future if these structures * diverge significantly enough. */ ptrace_get_syscall_info_entry(child, regs, info); info->seccomp.ret_data = child->ptrace_message; /* * ret_data is the last non-reserved field * in struct ptrace_syscall_info.seccomp */ return offsetofend(struct ptrace_syscall_info, seccomp.ret_data); } static unsigned long ptrace_get_syscall_info_exit(struct task_struct *child, struct pt_regs *regs, struct ptrace_syscall_info *info) { info->exit.rval = syscall_get_error(child, regs); info->exit.is_error = !!info->exit.rval; if (!info->exit.is_error) info->exit.rval = syscall_get_return_value(child, regs); /* is_error is the last field in struct ptrace_syscall_info.exit */ return offsetofend(struct ptrace_syscall_info, exit.is_error); } static int ptrace_get_syscall_info_op(struct task_struct *child) { /* * This does not need lock_task_sighand() to access * child->last_siginfo because ptrace_freeze_traced() * called earlier by ptrace_check_attach() ensures that * the tracee cannot go away and clear its last_siginfo. */ switch (child->last_siginfo ? child->last_siginfo->si_code : 0) { case SIGTRAP | 0x80: switch (child->ptrace_message) { case PTRACE_EVENTMSG_SYSCALL_ENTRY: return PTRACE_SYSCALL_INFO_ENTRY; case PTRACE_EVENTMSG_SYSCALL_EXIT: return PTRACE_SYSCALL_INFO_EXIT; default: return PTRACE_SYSCALL_INFO_NONE; } case SIGTRAP | (PTRACE_EVENT_SECCOMP << 8): return PTRACE_SYSCALL_INFO_SECCOMP; default: return PTRACE_SYSCALL_INFO_NONE; } } static int ptrace_get_syscall_info(struct task_struct *child, unsigned long user_size, void __user *datavp) { struct pt_regs *regs = task_pt_regs(child); struct ptrace_syscall_info info = { .op = ptrace_get_syscall_info_op(child), .arch = syscall_get_arch(child), .instruction_pointer = instruction_pointer(regs), .stack_pointer = user_stack_pointer(regs), }; unsigned long actual_size = offsetof(struct ptrace_syscall_info, entry); unsigned long write_size; switch (info.op) { case PTRACE_SYSCALL_INFO_ENTRY: actual_size = ptrace_get_syscall_info_entry(child, regs, &info); break; case PTRACE_SYSCALL_INFO_EXIT: actual_size = ptrace_get_syscall_info_exit(child, regs, &info); break; case PTRACE_SYSCALL_INFO_SECCOMP: actual_size = ptrace_get_syscall_info_seccomp(child, regs, &info); break; } write_size = min(actual_size, user_size); return copy_to_user(datavp, &info, write_size) ? -EFAULT : actual_size; } static int ptrace_set_syscall_info_entry(struct task_struct *child, struct pt_regs *regs, struct ptrace_syscall_info *info) { unsigned long args[ARRAY_SIZE(info->entry.args)]; int nr = info->entry.nr; int i; /* * Check that the syscall number specified in info->entry.nr * is either a value of type "int" or a sign-extended value * of type "int". */ if (nr != info->entry.nr) return -ERANGE; for (i = 0; i < ARRAY_SIZE(args); i++) { args[i] = info->entry.args[i]; /* * Check that the syscall argument specified in * info->entry.args[i] is either a value of type * "unsigned long" or a sign-extended value of type "long". */ if (args[i] != info->entry.args[i]) return -ERANGE; } syscall_set_nr(child, regs, nr); /* * If the syscall number is set to -1, setting syscall arguments is not * just pointless, it would also clobber the syscall return value on * those architectures that share the same register both for the first * argument of syscall and its return value. */ if (nr != -1) syscall_set_arguments(child, regs, args); return 0; } static int ptrace_set_syscall_info_seccomp(struct task_struct *child, struct pt_regs *regs, struct ptrace_syscall_info *info) { /* * info->entry is currently a subset of info->seccomp, * info->seccomp.ret_data is currently ignored. */ return ptrace_set_syscall_info_entry(child, regs, info); } static int ptrace_set_syscall_info_exit(struct task_struct *child, struct pt_regs *regs, struct ptrace_syscall_info *info) { long rval = info->exit.rval; /* * Check that the return value specified in info->exit.rval * is either a value of type "long" or a sign-extended value * of type "long". */ if (rval != info->exit.rval) return -ERANGE; if (info->exit.is_error) syscall_set_return_value(child, regs, rval, 0); else syscall_set_return_value(child, regs, 0, rval); return 0; } static int ptrace_set_syscall_info(struct task_struct *child, unsigned long user_size, const void __user *datavp) { struct pt_regs *regs = task_pt_regs(child); struct ptrace_syscall_info info; if (user_size < sizeof(info)) return -EINVAL; /* * The compatibility is tracked by info.op and info.flags: if user-space * does not instruct us to use unknown extra bits from future versions * of ptrace_syscall_info, we are not going to read them either. */ if (copy_from_user(&info, datavp, sizeof(info))) return -EFAULT; /* Reserved for future use. */ if (info.flags || info.reserved) return -EINVAL; /* Changing the type of the system call stop is not supported yet. */ if (ptrace_get_syscall_info_op(child) != info.op) return -EINVAL; switch (info.op) { case PTRACE_SYSCALL_INFO_ENTRY: return ptrace_set_syscall_info_entry(child, regs, &info); case PTRACE_SYSCALL_INFO_EXIT: return ptrace_set_syscall_info_exit(child, regs, &info); case PTRACE_SYSCALL_INFO_SECCOMP: return ptrace_set_syscall_info_seccomp(child, regs, &info); default: /* Other types of system call stops are not supported yet. */ return -EINVAL; } } #endif /* CONFIG_HAVE_ARCH_TRACEHOOK */ int ptrace_request(struct task_struct *child, long request, unsigned long addr, unsigned long data) { bool seized = child->ptrace & PT_SEIZED; int ret = -EIO; kernel_siginfo_t siginfo, *si; void __user *datavp = (void __user *) data; unsigned long __user *datalp = datavp; unsigned long flags; switch (request) { case PTRACE_PEEKTEXT: case PTRACE_PEEKDATA: return generic_ptrace_peekdata(child, addr, data); case PTRACE_POKETEXT: case PTRACE_POKEDATA: return generic_ptrace_pokedata(child, addr, data); #ifdef PTRACE_OLDSETOPTIONS case PTRACE_OLDSETOPTIONS: #endif case PTRACE_SETOPTIONS: ret = ptrace_setoptions(child, data); break; case PTRACE_GETEVENTMSG: ret = put_user(child->ptrace_message, datalp); break; case PTRACE_PEEKSIGINFO: ret = ptrace_peek_siginfo(child, addr, data); break; case PTRACE_GETSIGINFO: ret = ptrace_getsiginfo(child, &siginfo); if (!ret) ret = copy_siginfo_to_user(datavp, &siginfo); break; case PTRACE_SETSIGINFO: ret = copy_siginfo_from_user(&siginfo, datavp); if (!ret) ret = ptrace_setsiginfo(child, &siginfo); break; case PTRACE_GETSIGMASK: { sigset_t *mask; if (addr != sizeof(sigset_t)) { ret = -EINVAL; break; } if (test_tsk_restore_sigmask(child)) mask = &child->saved_sigmask; else mask = &child->blocked; if (copy_to_user(datavp, mask, sizeof(sigset_t))) ret = -EFAULT; else ret = 0; break; } case PTRACE_SETSIGMASK: { sigset_t new_set; if (addr != sizeof(sigset_t)) { ret = -EINVAL; break; } if (copy_from_user(&new_set, datavp, sizeof(sigset_t))) { ret = -EFAULT; break; } sigdelsetmask(&new_set, sigmask(SIGKILL)|sigmask(SIGSTOP)); /* * Every thread does recalc_sigpending() after resume, so * retarget_shared_pending() and recalc_sigpending() are not * called here. */ spin_lock_irq(&child->sighand->siglock); child->blocked = new_set; spin_unlock_irq(&child->sighand->siglock); clear_tsk_restore_sigmask(child); ret = 0; break; } case PTRACE_INTERRUPT: /* * Stop tracee without any side-effect on signal or job * control. At least one trap is guaranteed to happen * after this request. If @child is already trapped, the * current trap is not disturbed and another trap will * happen after the current trap is ended with PTRACE_CONT. * * The actual trap might not be PTRACE_EVENT_STOP trap but * the pending condition is cleared regardless. */ if (unlikely(!seized || !lock_task_sighand(child, &flags))) break; /* * INTERRUPT doesn't disturb existing trap sans one * exception. If ptracer issued LISTEN for the current * STOP, this INTERRUPT should clear LISTEN and re-trap * tracee into STOP. */ if (likely(task_set_jobctl_pending(child, JOBCTL_TRAP_STOP))) ptrace_signal_wake_up(child, child->jobctl & JOBCTL_LISTENING); unlock_task_sighand(child, &flags); ret = 0; break; case PTRACE_LISTEN: /* * Listen for events. Tracee must be in STOP. It's not * resumed per-se but is not considered to be in TRACED by * wait(2) or ptrace(2). If an async event (e.g. group * stop state change) happens, tracee will enter STOP trap * again. Alternatively, ptracer can issue INTERRUPT to * finish listening and re-trap tracee into STOP. */ if (unlikely(!seized || !lock_task_sighand(child, &flags))) break; si = child->last_siginfo; if (likely(si && (si->si_code >> 8) == PTRACE_EVENT_STOP)) { child->jobctl |= JOBCTL_LISTENING; /* * If NOTIFY is set, it means event happened between * start of this trap and now. Trigger re-trap. */ if (child->jobctl & JOBCTL_TRAP_NOTIFY) ptrace_signal_wake_up(child, true); ret = 0; } unlock_task_sighand(child, &flags); break; case PTRACE_DETACH: /* detach a process that was attached. */ ret = ptrace_detach(child, data); break; #ifdef CONFIG_BINFMT_ELF_FDPIC case PTRACE_GETFDPIC: { struct mm_struct *mm = get_task_mm(child); unsigned long tmp = 0; ret = -ESRCH; if (!mm) break; switch (addr) { case PTRACE_GETFDPIC_EXEC: tmp = mm->context.exec_fdpic_loadmap; break; case PTRACE_GETFDPIC_INTERP: tmp = mm->context.interp_fdpic_loadmap; break; default: break; } mmput(mm); ret = put_user(tmp, datalp); break; } #endif case PTRACE_SINGLESTEP: #ifdef PTRACE_SINGLEBLOCK case PTRACE_SINGLEBLOCK: #endif #ifdef PTRACE_SYSEMU case PTRACE_SYSEMU: case PTRACE_SYSEMU_SINGLESTEP: #endif case PTRACE_SYSCALL: case PTRACE_CONT: return ptrace_resume(child, request, data); case PTRACE_KILL: send_sig_info(SIGKILL, SEND_SIG_NOINFO, child); return 0; #ifdef CONFIG_HAVE_ARCH_TRACEHOOK case PTRACE_GETREGSET: case PTRACE_SETREGSET: { struct iovec kiov; struct iovec __user *uiov = datavp; if (!access_ok(uiov, sizeof(*uiov))) return -EFAULT; if (__get_user(kiov.iov_base, &uiov->iov_base) || __get_user(kiov.iov_len, &uiov->iov_len)) return -EFAULT; ret = ptrace_regset(child, request, addr, &kiov); if (!ret) ret = __put_user(kiov.iov_len, &uiov->iov_len); break; } case PTRACE_GET_SYSCALL_INFO: ret = ptrace_get_syscall_info(child, addr, datavp); break; case PTRACE_SET_SYSCALL_INFO: ret = ptrace_set_syscall_info(child, addr, datavp); break; #endif case PTRACE_SECCOMP_GET_FILTER: ret = seccomp_get_filter(child, addr, datavp); break; case PTRACE_SECCOMP_GET_METADATA: ret = seccomp_get_metadata(child, addr, datavp); break; #ifdef CONFIG_RSEQ case PTRACE_GET_RSEQ_CONFIGURATION: ret = ptrace_get_rseq_configuration(child, addr, datavp); break; #endif case PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG: ret = syscall_user_dispatch_set_config(child, addr, datavp); break; case PTRACE_GET_SYSCALL_USER_DISPATCH_CONFIG: ret = syscall_user_dispatch_get_config(child, addr, datavp); break; default: break; } return ret; } SYSCALL_DEFINE4(ptrace, long, request, long, pid, unsigned long, addr, unsigned long, data) { struct task_struct *child; long ret; if (request == PTRACE_TRACEME) { ret = ptrace_traceme(); goto out; } child = find_get_task_by_vpid(pid); if (!child) { ret = -ESRCH; goto out; } if (request == PTRACE_ATTACH || request == PTRACE_SEIZE) { ret = ptrace_attach(child, request, addr, data); goto out_put_task_struct; } ret = ptrace_check_attach(child, request == PTRACE_KILL || request == PTRACE_INTERRUPT); if (ret < 0) goto out_put_task_struct; ret = arch_ptrace(child, request, addr, data); if (ret || request != PTRACE_DETACH) ptrace_unfreeze_traced(child); out_put_task_struct: put_task_struct(child); out: return ret; } int generic_ptrace_peekdata(struct task_struct *tsk, unsigned long addr, unsigned long data) { unsigned long tmp; int copied; copied = ptrace_access_vm(tsk, addr, &tmp, sizeof(tmp), FOLL_FORCE); if (copied != sizeof(tmp)) return -EIO; return put_user(tmp, (unsigned long __user *)data); } int generic_ptrace_pokedata(struct task_struct *tsk, unsigned long addr, unsigned long data) { int copied; copied = ptrace_access_vm(tsk, addr, &data, sizeof(data), FOLL_FORCE | FOLL_WRITE); return (copied == sizeof(data)) ? 0 : -EIO; } #if defined CONFIG_COMPAT int compat_ptrace_request(struct task_struct *child, compat_long_t request, compat_ulong_t addr, compat_ulong_t data) { compat_ulong_t __user *datap = compat_ptr(data); compat_ulong_t word; kernel_siginfo_t siginfo; int ret; switch (request) { case PTRACE_PEEKTEXT: case PTRACE_PEEKDATA: ret = ptrace_access_vm(child, addr, &word, sizeof(word), FOLL_FORCE); if (ret != sizeof(word)) ret = -EIO; else ret = put_user(word, datap); break; case PTRACE_POKETEXT: case PTRACE_POKEDATA: ret = ptrace_access_vm(child, addr, &data, sizeof(data), FOLL_FORCE | FOLL_WRITE); ret = (ret != sizeof(data) ? -EIO : 0); break; case PTRACE_GETEVENTMSG: ret = put_user((compat_ulong_t) child->ptrace_message, datap); break; case PTRACE_GETSIGINFO: ret = ptrace_getsiginfo(child, &siginfo); if (!ret) ret = copy_siginfo_to_user32( (struct compat_siginfo __user *) datap, &siginfo); break; case PTRACE_SETSIGINFO: ret = copy_siginfo_from_user32( &siginfo, (struct compat_siginfo __user *) datap); if (!ret) ret = ptrace_setsiginfo(child, &siginfo); break; #ifdef CONFIG_HAVE_ARCH_TRACEHOOK case PTRACE_GETREGSET: case PTRACE_SETREGSET: { struct iovec kiov; struct compat_iovec __user *uiov = (struct compat_iovec __user *) datap; compat_uptr_t ptr; compat_size_t len; if (!access_ok(uiov, sizeof(*uiov))) return -EFAULT; if (__get_user(ptr, &uiov->iov_base) || __get_user(len, &uiov->iov_len)) return -EFAULT; kiov.iov_base = compat_ptr(ptr); kiov.iov_len = len; ret = ptrace_regset(child, request, addr, &kiov); if (!ret) ret = __put_user(kiov.iov_len, &uiov->iov_len); break; } #endif default: ret = ptrace_request(child, request, addr, data); } return ret; } COMPAT_SYSCALL_DEFINE4(ptrace, compat_long_t, request, compat_long_t, pid, compat_long_t, addr, compat_long_t, data) { struct task_struct *child; long ret; if (request == PTRACE_TRACEME) { ret = ptrace_traceme(); goto out; } child = find_get_task_by_vpid(pid); if (!child) { ret = -ESRCH; goto out; } if (request == PTRACE_ATTACH || request == PTRACE_SEIZE) { ret = ptrace_attach(child, request, addr, data); goto out_put_task_struct; } ret = ptrace_check_attach(child, request == PTRACE_KILL || request == PTRACE_INTERRUPT); if (!ret) { ret = compat_arch_ptrace(child, request, addr, data); if (ret || request != PTRACE_DETACH) ptrace_unfreeze_traced(child); } out_put_task_struct: put_task_struct(child); out: return ret; } #endif /* CONFIG_COMPAT */ |
| 42 41 41 41 41 26 24 26 7 5 3 2 1 3 3 1 8 2 1 1 27 27 22 27 25 2 2 27 26 26 3 27 27 27 1 26 27 27 10 5 6 4 3 3 3 1 2 4 4 1 3 2 1 1 1 2 5 4 16 5 1 2 6 5 4 3 7 3 2 2 2 1 1 4 3 2 1 1 1 33 34 4 2 4 1 2 5 7 7 3 3 3 16 34 40 38 37 31 3 1 2 1 1 40 16 15 6 5 4 2 8 7 1 6 5 4 3 4 5 15 15 29 22 22 22 25 25 23 25 15 27 2 2 2 1 1 2 2 2 2 2 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 | /* * Copyright (c) 2006, 2019 Oracle and/or its affiliates. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * */ #include <linux/module.h> #include <linux/errno.h> #include <linux/kernel.h> #include <linux/gfp.h> #include <linux/in.h> #include <linux/ipv6.h> #include <linux/poll.h> #include <net/sock.h> #include "rds.h" /* this is just used for stats gathering :/ */ static DEFINE_SPINLOCK(rds_sock_lock); static unsigned long rds_sock_count; static LIST_HEAD(rds_sock_list); DECLARE_WAIT_QUEUE_HEAD(rds_poll_waitq); /* * This is called as the final descriptor referencing this socket is closed. * We have to unbind the socket so that another socket can be bound to the * address it was using. * * We have to be careful about racing with the incoming path. sock_orphan() * sets SOCK_DEAD and we use that as an indicator to the rx path that new * messages shouldn't be queued. */ static int rds_release(struct socket *sock) { struct sock *sk = sock->sk; struct rds_sock *rs; if (!sk) goto out; rs = rds_sk_to_rs(sk); sock_orphan(sk); /* Note - rds_clear_recv_queue grabs rs_recv_lock, so * that ensures the recv path has completed messing * with the socket. */ rds_clear_recv_queue(rs); rds_cong_remove_socket(rs); rds_remove_bound(rs); rds_send_drop_to(rs, NULL); rds_rdma_drop_keys(rs); rds_notify_queue_get(rs, NULL); rds_notify_msg_zcopy_purge(&rs->rs_zcookie_queue); spin_lock_bh(&rds_sock_lock); list_del_init(&rs->rs_item); rds_sock_count--; spin_unlock_bh(&rds_sock_lock); rds_trans_put(rs->rs_transport); sock->sk = NULL; sock_put(sk); out: return 0; } /* * Careful not to race with rds_release -> sock_orphan which clears sk_sleep. * _bh() isn't OK here, we're called from interrupt handlers. It's probably OK * to wake the waitqueue after sk_sleep is clear as we hold a sock ref, but * this seems more conservative. * NB - normally, one would use sk_callback_lock for this, but we can * get here from interrupts, whereas the network code grabs sk_callback_lock * with _lock_bh only - so relying on sk_callback_lock introduces livelocks. */ void rds_wake_sk_sleep(struct rds_sock *rs) { unsigned long flags; read_lock_irqsave(&rs->rs_recv_lock, flags); __rds_wake_sk_sleep(rds_rs_to_sk(rs)); read_unlock_irqrestore(&rs->rs_recv_lock, flags); } static int rds_getname(struct socket *sock, struct sockaddr *uaddr, int peer) { struct rds_sock *rs = rds_sk_to_rs(sock->sk); struct sockaddr_in6 *sin6; struct sockaddr_in *sin; int uaddr_len; /* racey, don't care */ if (peer) { if (ipv6_addr_any(&rs->rs_conn_addr)) return -ENOTCONN; if (ipv6_addr_v4mapped(&rs->rs_conn_addr)) { sin = (struct sockaddr_in *)uaddr; memset(sin->sin_zero, 0, sizeof(sin->sin_zero)); sin->sin_family = AF_INET; sin->sin_port = rs->rs_conn_port; sin->sin_addr.s_addr = rs->rs_conn_addr_v4; uaddr_len = sizeof(*sin); } else { sin6 = (struct sockaddr_in6 *)uaddr; sin6->sin6_family = AF_INET6; sin6->sin6_port = rs->rs_conn_port; sin6->sin6_addr = rs->rs_conn_addr; sin6->sin6_flowinfo = 0; /* scope_id is the same as in the bound address. */ sin6->sin6_scope_id = rs->rs_bound_scope_id; uaddr_len = sizeof(*sin6); } } else { /* If socket is not yet bound and the socket is connected, * set the return address family to be the same as the * connected address, but with 0 address value. If it is not * connected, set the family to be AF_UNSPEC (value 0) and * the address size to be that of an IPv4 address. */ if (ipv6_addr_any(&rs->rs_bound_addr)) { if (ipv6_addr_any(&rs->rs_conn_addr)) { sin = (struct sockaddr_in *)uaddr; memset(sin, 0, sizeof(*sin)); sin->sin_family = AF_UNSPEC; return sizeof(*sin); } #if IS_ENABLED(CONFIG_IPV6) if (!(ipv6_addr_type(&rs->rs_conn_addr) & IPV6_ADDR_MAPPED)) { sin6 = (struct sockaddr_in6 *)uaddr; memset(sin6, 0, sizeof(*sin6)); sin6->sin6_family = AF_INET6; return sizeof(*sin6); } #endif sin = (struct sockaddr_in *)uaddr; memset(sin, 0, sizeof(*sin)); sin->sin_family = AF_INET; return sizeof(*sin); } if (ipv6_addr_v4mapped(&rs->rs_bound_addr)) { sin = (struct sockaddr_in *)uaddr; memset(sin->sin_zero, 0, sizeof(sin->sin_zero)); sin->sin_family = AF_INET; sin->sin_port = rs->rs_bound_port; sin->sin_addr.s_addr = rs->rs_bound_addr_v4; uaddr_len = sizeof(*sin); } else { sin6 = (struct sockaddr_in6 *)uaddr; sin6->sin6_family = AF_INET6; sin6->sin6_port = rs->rs_bound_port; sin6->sin6_addr = rs->rs_bound_addr; sin6->sin6_flowinfo = 0; sin6->sin6_scope_id = rs->rs_bound_scope_id; uaddr_len = sizeof(*sin6); } } return uaddr_len; } /* * RDS' poll is without a doubt the least intuitive part of the interface, * as EPOLLIN and EPOLLOUT do not behave entirely as you would expect from * a network protocol. * * EPOLLIN is asserted if * - there is data on the receive queue. * - to signal that a previously congested destination may have become * uncongested * - A notification has been queued to the socket (this can be a congestion * update, or a RDMA completion, or a MSG_ZEROCOPY completion). * * EPOLLOUT is asserted if there is room on the send queue. This does not mean * however, that the next sendmsg() call will succeed. If the application tries * to send to a congested destination, the system call may still fail (and * return ENOBUFS). */ static __poll_t rds_poll(struct file *file, struct socket *sock, poll_table *wait) { struct sock *sk = sock->sk; struct rds_sock *rs = rds_sk_to_rs(sk); __poll_t mask = 0; unsigned long flags; poll_wait(file, sk_sleep(sk), wait); if (rs->rs_seen_congestion) poll_wait(file, &rds_poll_waitq, wait); read_lock_irqsave(&rs->rs_recv_lock, flags); if (!rs->rs_cong_monitor) { /* When a congestion map was updated, we signal EPOLLIN for * "historical" reasons. Applications can also poll for * WRBAND instead. */ if (rds_cong_updated_since(&rs->rs_cong_track)) mask |= (EPOLLIN | EPOLLRDNORM | EPOLLWRBAND); } else { spin_lock(&rs->rs_lock); if (rs->rs_cong_notify) mask |= (EPOLLIN | EPOLLRDNORM); spin_unlock(&rs->rs_lock); } if (!list_empty(&rs->rs_recv_queue) || !list_empty(&rs->rs_notify_queue) || !list_empty(&rs->rs_zcookie_queue.zcookie_head)) mask |= (EPOLLIN | EPOLLRDNORM); if (rs->rs_snd_bytes < rds_sk_sndbuf(rs)) mask |= (EPOLLOUT | EPOLLWRNORM); if (sk->sk_err || !skb_queue_empty(&sk->sk_error_queue)) mask |= EPOLLERR; read_unlock_irqrestore(&rs->rs_recv_lock, flags); /* clear state any time we wake a seen-congested socket */ if (mask) rs->rs_seen_congestion = 0; return mask; } static int rds_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) { struct rds_sock *rs = rds_sk_to_rs(sock->sk); rds_tos_t utos, tos = 0; switch (cmd) { case SIOCRDSSETTOS: if (get_user(utos, (rds_tos_t __user *)arg)) return -EFAULT; if (rs->rs_transport && rs->rs_transport->get_tos_map) tos = rs->rs_transport->get_tos_map(utos); else return -ENOIOCTLCMD; spin_lock_bh(&rds_sock_lock); if (rs->rs_tos || rs->rs_conn) { spin_unlock_bh(&rds_sock_lock); return -EINVAL; } rs->rs_tos = tos; spin_unlock_bh(&rds_sock_lock); break; case SIOCRDSGETTOS: spin_lock_bh(&rds_sock_lock); tos = rs->rs_tos; spin_unlock_bh(&rds_sock_lock); if (put_user(tos, (rds_tos_t __user *)arg)) return -EFAULT; break; default: return -ENOIOCTLCMD; } return 0; } static int rds_cancel_sent_to(struct rds_sock *rs, sockptr_t optval, int len) { struct sockaddr_in6 sin6; struct sockaddr_in sin; int ret = 0; /* racing with another thread binding seems ok here */ if (ipv6_addr_any(&rs->rs_bound_addr)) { ret = -ENOTCONN; /* XXX not a great errno */ goto out; } if (len < sizeof(struct sockaddr_in)) { ret = -EINVAL; goto out; } else if (len < sizeof(struct sockaddr_in6)) { /* Assume IPv4 */ if (copy_from_sockptr(&sin, optval, sizeof(struct sockaddr_in))) { ret = -EFAULT; goto out; } ipv6_addr_set_v4mapped(sin.sin_addr.s_addr, &sin6.sin6_addr); sin6.sin6_port = sin.sin_port; } else { if (copy_from_sockptr(&sin6, optval, sizeof(struct sockaddr_in6))) { ret = -EFAULT; goto out; } } rds_send_drop_to(rs, &sin6); out: return ret; } static int rds_set_bool_option(unsigned char *optvar, sockptr_t optval, int optlen) { int value; if (optlen < sizeof(int)) return -EINVAL; if (copy_from_sockptr(&value, optval, sizeof(int))) return -EFAULT; *optvar = !!value; return 0; } static int rds_cong_monitor(struct rds_sock *rs, sockptr_t optval, int optlen) { int ret; ret = rds_set_bool_option(&rs->rs_cong_monitor, optval, optlen); if (ret == 0) { if (rs->rs_cong_monitor) { rds_cong_add_socket(rs); } else { rds_cong_remove_socket(rs); rs->rs_cong_mask = 0; rs->rs_cong_notify = 0; } } return ret; } static int rds_set_transport(struct rds_sock *rs, sockptr_t optval, int optlen) { int t_type; if (rs->rs_transport) return -EOPNOTSUPP; /* previously attached to transport */ if (optlen != sizeof(int)) return -EINVAL; if (copy_from_sockptr(&t_type, optval, sizeof(t_type))) return -EFAULT; if (t_type < 0 || t_type >= RDS_TRANS_COUNT) return -EINVAL; rs->rs_transport = rds_trans_get(t_type); return rs->rs_transport ? 0 : -ENOPROTOOPT; } static int rds_enable_recvtstamp(struct sock *sk, sockptr_t optval, int optlen, int optname) { int val, valbool; if (optlen != sizeof(int)) return -EFAULT; if (copy_from_sockptr(&val, optval, sizeof(int))) return -EFAULT; valbool = val ? 1 : 0; if (optname == SO_TIMESTAMP_NEW) sock_set_flag(sk, SOCK_TSTAMP_NEW); if (valbool) sock_set_flag(sk, SOCK_RCVTSTAMP); else sock_reset_flag(sk, SOCK_RCVTSTAMP); return 0; } static int rds_recv_track_latency(struct rds_sock *rs, sockptr_t optval, int optlen) { struct rds_rx_trace_so trace; int i; if (optlen != sizeof(struct rds_rx_trace_so)) return -EFAULT; if (copy_from_sockptr(&trace, optval, sizeof(trace))) return -EFAULT; if (trace.rx_traces > RDS_MSG_RX_DGRAM_TRACE_MAX) return -EFAULT; rs->rs_rx_traces = trace.rx_traces; for (i = 0; i < rs->rs_rx_traces; i++) { if (trace.rx_trace_pos[i] >= RDS_MSG_RX_DGRAM_TRACE_MAX) { rs->rs_rx_traces = 0; return -EFAULT; } rs->rs_rx_trace[i] = trace.rx_trace_pos[i]; } return 0; } static int rds_setsockopt(struct socket *sock, int level, int optname, sockptr_t optval, unsigned int optlen) { struct rds_sock *rs = rds_sk_to_rs(sock->sk); int ret; if (level != SOL_RDS) { ret = -ENOPROTOOPT; goto out; } switch (optname) { case RDS_CANCEL_SENT_TO: ret = rds_cancel_sent_to(rs, optval, optlen); break; case RDS_GET_MR: ret = rds_get_mr(rs, optval, optlen); break; case RDS_GET_MR_FOR_DEST: ret = rds_get_mr_for_dest(rs, optval, optlen); break; case RDS_FREE_MR: ret = rds_free_mr(rs, optval, optlen); break; case RDS_RECVERR: ret = rds_set_bool_option(&rs->rs_recverr, optval, optlen); break; case RDS_CONG_MONITOR: ret = rds_cong_monitor(rs, optval, optlen); break; case SO_RDS_TRANSPORT: lock_sock(sock->sk); ret = rds_set_transport(rs, optval, optlen); release_sock(sock->sk); break; case SO_TIMESTAMP_OLD: case SO_TIMESTAMP_NEW: lock_sock(sock->sk); ret = rds_enable_recvtstamp(sock->sk, optval, optlen, optname); release_sock(sock->sk); break; case SO_RDS_MSG_RXPATH_LATENCY: ret = rds_recv_track_latency(rs, optval, optlen); break; default: ret = -ENOPROTOOPT; } out: return ret; } static int rds_getsockopt(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen) { struct rds_sock *rs = rds_sk_to_rs(sock->sk); int ret = -ENOPROTOOPT, len; int trans; if (level != SOL_RDS) goto out; if (get_user(len, optlen)) { ret = -EFAULT; goto out; } switch (optname) { case RDS_INFO_FIRST ... RDS_INFO_LAST: ret = rds_info_getsockopt(sock, optname, optval, optlen); break; case RDS_RECVERR: if (len < sizeof(int)) ret = -EINVAL; else if (put_user(rs->rs_recverr, (int __user *) optval) || put_user(sizeof(int), optlen)) ret = -EFAULT; else ret = 0; break; case SO_RDS_TRANSPORT: if (len < sizeof(int)) { ret = -EINVAL; break; } trans = (rs->rs_transport ? rs->rs_transport->t_type : RDS_TRANS_NONE); /* unbound */ if (put_user(trans, (int __user *)optval) || put_user(sizeof(int), optlen)) ret = -EFAULT; else ret = 0; break; default: break; } out: return ret; } static int rds_connect(struct socket *sock, struct sockaddr *uaddr, int addr_len, int flags) { struct sock *sk = sock->sk; struct sockaddr_in *sin; struct rds_sock *rs = rds_sk_to_rs(sk); int ret = 0; if (addr_len < offsetofend(struct sockaddr, sa_family)) return -EINVAL; lock_sock(sk); switch (uaddr->sa_family) { case AF_INET: sin = (struct sockaddr_in *)uaddr; if (addr_len < sizeof(struct sockaddr_in)) { ret = -EINVAL; break; } if (sin->sin_addr.s_addr == htonl(INADDR_ANY)) { ret = -EDESTADDRREQ; break; } if (ipv4_is_multicast(sin->sin_addr.s_addr) || sin->sin_addr.s_addr == htonl(INADDR_BROADCAST)) { ret = -EINVAL; break; } ipv6_addr_set_v4mapped(sin->sin_addr.s_addr, &rs->rs_conn_addr); rs->rs_conn_port = sin->sin_port; break; #if IS_ENABLED(CONFIG_IPV6) case AF_INET6: { struct sockaddr_in6 *sin6; int addr_type; sin6 = (struct sockaddr_in6 *)uaddr; if (addr_len < sizeof(struct sockaddr_in6)) { ret = -EINVAL; break; } addr_type = ipv6_addr_type(&sin6->sin6_addr); if (!(addr_type & IPV6_ADDR_UNICAST)) { __be32 addr4; if (!(addr_type & IPV6_ADDR_MAPPED)) { ret = -EPROTOTYPE; break; } /* It is a mapped address. Need to do some sanity * checks. */ addr4 = sin6->sin6_addr.s6_addr32[3]; if (addr4 == htonl(INADDR_ANY) || addr4 == htonl(INADDR_BROADCAST) || ipv4_is_multicast(addr4)) { ret = -EPROTOTYPE; break; } } if (addr_type & IPV6_ADDR_LINKLOCAL) { /* If socket is already bound to a link local address, * the peer address must be on the same link. */ if (sin6->sin6_scope_id == 0 || (!ipv6_addr_any(&rs->rs_bound_addr) && rs->rs_bound_scope_id && sin6->sin6_scope_id != rs->rs_bound_scope_id)) { ret = -EINVAL; break; } /* Remember the connected address scope ID. It will * be checked against the binding local address when * the socket is bound. */ rs->rs_bound_scope_id = sin6->sin6_scope_id; } rs->rs_conn_addr = sin6->sin6_addr; rs->rs_conn_port = sin6->sin6_port; break; } #endif default: ret = -EAFNOSUPPORT; break; } release_sock(sk); return ret; } static struct proto rds_proto = { .name = "RDS", .owner = THIS_MODULE, .obj_size = sizeof(struct rds_sock), }; static const struct proto_ops rds_proto_ops = { .family = AF_RDS, .owner = THIS_MODULE, .release = rds_release, .bind = rds_bind, .connect = rds_connect, .socketpair = sock_no_socketpair, .accept = sock_no_accept, .getname = rds_getname, .poll = rds_poll, .ioctl = rds_ioctl, .listen = sock_no_listen, .shutdown = sock_no_shutdown, .setsockopt = rds_setsockopt, .getsockopt = rds_getsockopt, .sendmsg = rds_sendmsg, .recvmsg = rds_recvmsg, .mmap = sock_no_mmap, }; static void rds_sock_destruct(struct sock *sk) { struct rds_sock *rs = rds_sk_to_rs(sk); WARN_ON((&rs->rs_item != rs->rs_item.next || &rs->rs_item != rs->rs_item.prev)); } static int __rds_create(struct socket *sock, struct sock *sk, int protocol) { struct rds_sock *rs; sock_init_data(sock, sk); sock->ops = &rds_proto_ops; sk->sk_protocol = protocol; sk->sk_destruct = rds_sock_destruct; rs = rds_sk_to_rs(sk); spin_lock_init(&rs->rs_lock); rwlock_init(&rs->rs_recv_lock); INIT_LIST_HEAD(&rs->rs_send_queue); INIT_LIST_HEAD(&rs->rs_recv_queue); INIT_LIST_HEAD(&rs->rs_notify_queue); INIT_LIST_HEAD(&rs->rs_cong_list); rds_message_zcopy_queue_init(&rs->rs_zcookie_queue); spin_lock_init(&rs->rs_rdma_lock); rs->rs_rdma_keys = RB_ROOT; rs->rs_rx_traces = 0; rs->rs_tos = 0; rs->rs_conn = NULL; spin_lock_bh(&rds_sock_lock); list_add_tail(&rs->rs_item, &rds_sock_list); rds_sock_count++; spin_unlock_bh(&rds_sock_lock); return 0; } static int rds_create(struct net *net, struct socket *sock, int protocol, int kern) { struct sock *sk; if (sock->type != SOCK_SEQPACKET || protocol) return -ESOCKTNOSUPPORT; sk = sk_alloc(net, AF_RDS, GFP_KERNEL, &rds_proto, kern); if (!sk) return -ENOMEM; return __rds_create(sock, sk, protocol); } void rds_sock_addref(struct rds_sock *rs) { sock_hold(rds_rs_to_sk(rs)); } void rds_sock_put(struct rds_sock *rs) { sock_put(rds_rs_to_sk(rs)); } static const struct net_proto_family rds_family_ops = { .family = AF_RDS, .create = rds_create, .owner = THIS_MODULE, }; static void rds_sock_inc_info(struct socket *sock, unsigned int len, struct rds_info_iterator *iter, struct rds_info_lengths *lens) { struct rds_sock *rs; struct rds_incoming *inc; unsigned int total = 0; len /= sizeof(struct rds_info_message); spin_lock_bh(&rds_sock_lock); list_for_each_entry(rs, &rds_sock_list, rs_item) { /* This option only supports IPv4 sockets. */ if (!ipv6_addr_v4mapped(&rs->rs_bound_addr)) continue; read_lock(&rs->rs_recv_lock); /* XXX too lazy to maintain counts.. */ list_for_each_entry(inc, &rs->rs_recv_queue, i_item) { total++; if (total <= len) rds_inc_info_copy(inc, iter, inc->i_saddr.s6_addr32[3], rs->rs_bound_addr_v4, 1); } read_unlock(&rs->rs_recv_lock); } spin_unlock_bh(&rds_sock_lock); lens->nr = total; lens->each = sizeof(struct rds_info_message); } #if IS_ENABLED(CONFIG_IPV6) static void rds6_sock_inc_info(struct socket *sock, unsigned int len, struct rds_info_iterator *iter, struct rds_info_lengths *lens) { struct rds_incoming *inc; unsigned int total = 0; struct rds_sock *rs; len /= sizeof(struct rds6_info_message); spin_lock_bh(&rds_sock_lock); list_for_each_entry(rs, &rds_sock_list, rs_item) { read_lock(&rs->rs_recv_lock); list_for_each_entry(inc, &rs->rs_recv_queue, i_item) { total++; if (total <= len) rds6_inc_info_copy(inc, iter, &inc->i_saddr, &rs->rs_bound_addr, 1); } read_unlock(&rs->rs_recv_lock); } spin_unlock_bh(&rds_sock_lock); lens->nr = total; lens->each = sizeof(struct rds6_info_message); } #endif static void rds_sock_info(struct socket *sock, unsigned int len, struct rds_info_iterator *iter, struct rds_info_lengths *lens) { struct rds_info_socket sinfo; unsigned int cnt = 0; struct rds_sock *rs; len /= sizeof(struct rds_info_socket); spin_lock_bh(&rds_sock_lock); if (len < rds_sock_count) { cnt = rds_sock_count; goto out; } list_for_each_entry(rs, &rds_sock_list, rs_item) { /* This option only supports IPv4 sockets. */ if (!ipv6_addr_v4mapped(&rs->rs_bound_addr)) continue; sinfo.sndbuf = rds_sk_sndbuf(rs); sinfo.rcvbuf = rds_sk_rcvbuf(rs); sinfo.bound_addr = rs->rs_bound_addr_v4; sinfo.connected_addr = rs->rs_conn_addr_v4; sinfo.bound_port = rs->rs_bound_port; sinfo.connected_port = rs->rs_conn_port; sinfo.inum = sock_i_ino(rds_rs_to_sk(rs)); rds_info_copy(iter, &sinfo, sizeof(sinfo)); cnt++; } out: lens->nr = cnt; lens->each = sizeof(struct rds_info_socket); spin_unlock_bh(&rds_sock_lock); } #if IS_ENABLED(CONFIG_IPV6) static void rds6_sock_info(struct socket *sock, unsigned int len, struct rds_info_iterator *iter, struct rds_info_lengths *lens) { struct rds6_info_socket sinfo6; struct rds_sock *rs; len /= sizeof(struct rds6_info_socket); spin_lock_bh(&rds_sock_lock); if (len < rds_sock_count) goto out; list_for_each_entry(rs, &rds_sock_list, rs_item) { sinfo6.sndbuf = rds_sk_sndbuf(rs); sinfo6.rcvbuf = rds_sk_rcvbuf(rs); sinfo6.bound_addr = rs->rs_bound_addr; sinfo6.connected_addr = rs->rs_conn_addr; sinfo6.bound_port = rs->rs_bound_port; sinfo6.connected_port = rs->rs_conn_port; sinfo6.inum = sock_i_ino(rds_rs_to_sk(rs)); rds_info_copy(iter, &sinfo6, sizeof(sinfo6)); } out: lens->nr = rds_sock_count; lens->each = sizeof(struct rds6_info_socket); spin_unlock_bh(&rds_sock_lock); } #endif static void rds_exit(void) { sock_unregister(rds_family_ops.family); proto_unregister(&rds_proto); rds_conn_exit(); rds_cong_exit(); rds_sysctl_exit(); rds_threads_exit(); rds_stats_exit(); rds_page_exit(); rds_bind_lock_destroy(); rds_info_deregister_func(RDS_INFO_SOCKETS, rds_sock_info); rds_info_deregister_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info); #if IS_ENABLED(CONFIG_IPV6) rds_info_deregister_func(RDS6_INFO_SOCKETS, rds6_sock_info); rds_info_deregister_func(RDS6_INFO_RECV_MESSAGES, rds6_sock_inc_info); #endif } module_exit(rds_exit); u32 rds_gen_num; static int __init rds_init(void) { int ret; net_get_random_once(&rds_gen_num, sizeof(rds_gen_num)); ret = rds_bind_lock_init(); if (ret) goto out; ret = rds_conn_init(); if (ret) goto out_bind; ret = rds_threads_init(); if (ret) goto out_conn; ret = rds_sysctl_init(); if (ret) goto out_threads; ret = rds_stats_init(); if (ret) goto out_sysctl; ret = proto_register(&rds_proto, 1); if (ret) goto out_stats; ret = sock_register(&rds_family_ops); if (ret) goto out_proto; rds_info_register_func(RDS_INFO_SOCKETS, rds_sock_info); rds_info_register_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info); #if IS_ENABLED(CONFIG_IPV6) rds_info_register_func(RDS6_INFO_SOCKETS, rds6_sock_info); rds_info_register_func(RDS6_INFO_RECV_MESSAGES, rds6_sock_inc_info); #endif goto out; out_proto: proto_unregister(&rds_proto); out_stats: rds_stats_exit(); out_sysctl: rds_sysctl_exit(); out_threads: rds_threads_exit(); out_conn: rds_conn_exit(); rds_cong_exit(); rds_page_exit(); out_bind: rds_bind_lock_destroy(); out: return ret; } module_init(rds_init); #define DRV_VERSION "4.0" #define DRV_RELDATE "Feb 12, 2009" MODULE_AUTHOR("Oracle Corporation <rds-devel@oss.oracle.com>"); MODULE_DESCRIPTION("RDS: Reliable Datagram Sockets" " v" DRV_VERSION " (" DRV_RELDATE ")"); MODULE_VERSION(DRV_VERSION); MODULE_LICENSE("Dual BSD/GPL"); MODULE_ALIAS_NETPROTO(PF_RDS); |
| 87 86 8 8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 | /* SPDX-License-Identifier: GPL-2.0-or-later */ /* Authentication token and access key management internal defs * * Copyright (C) 2003-5, 2007 Red Hat, Inc. All Rights Reserved. * Written by David Howells (dhowells@redhat.com) */ #ifndef _INTERNAL_H #define _INTERNAL_H #include <linux/sched.h> #include <linux/wait_bit.h> #include <linux/cred.h> #include <linux/key-type.h> #include <linux/task_work.h> #include <linux/keyctl.h> #include <linux/refcount.h> #include <linux/watch_queue.h> #include <linux/compat.h> #include <linux/mm.h> #include <linux/vmalloc.h> struct iovec; #ifdef __KDEBUG #define kenter(FMT, ...) \ printk(KERN_DEBUG "==> %s("FMT")\n", __func__, ##__VA_ARGS__) #define kleave(FMT, ...) \ printk(KERN_DEBUG "<== %s()"FMT"\n", __func__, ##__VA_ARGS__) #define kdebug(FMT, ...) \ printk(KERN_DEBUG " "FMT"\n", ##__VA_ARGS__) #else #define kenter(FMT, ...) \ no_printk(KERN_DEBUG "==> %s("FMT")\n", __func__, ##__VA_ARGS__) #define kleave(FMT, ...) \ no_printk(KERN_DEBUG "<== %s()"FMT"\n", __func__, ##__VA_ARGS__) #define kdebug(FMT, ...) \ no_printk(KERN_DEBUG FMT"\n", ##__VA_ARGS__) #endif extern struct key_type key_type_dead; extern struct key_type key_type_user; extern struct key_type key_type_logon; /*****************************************************************************/ /* * Keep track of keys for a user. * * This needs to be separate to user_struct to avoid a refcount-loop * (user_struct pins some keyrings which pin this struct). * * We also keep track of keys under request from userspace for this UID here. */ struct key_user { struct rb_node node; struct mutex cons_lock; /* construction initiation lock */ spinlock_t lock; refcount_t usage; /* for accessing qnkeys & qnbytes */ atomic_t nkeys; /* number of keys */ atomic_t nikeys; /* number of instantiated keys */ kuid_t uid; int qnkeys; /* number of keys allocated to this user */ int qnbytes; /* number of bytes allocated to this user */ }; extern struct rb_root key_user_tree; extern spinlock_t key_user_lock; extern struct key_user root_key_user; extern struct key_user *key_user_lookup(kuid_t uid); extern void key_user_put(struct key_user *user); /* * Key quota limits. * - root has its own separate limits to everyone else */ extern unsigned key_quota_root_maxkeys; extern unsigned key_quota_root_maxbytes; extern unsigned key_quota_maxkeys; extern unsigned key_quota_maxbytes; #define KEYQUOTA_LINK_BYTES 4 /* a link in a keyring is worth 4 bytes */ extern struct kmem_cache *key_jar; extern struct rb_root key_serial_tree; extern spinlock_t key_serial_lock; extern struct mutex key_construction_mutex; extern wait_queue_head_t request_key_conswq; extern void key_set_index_key(struct keyring_index_key *index_key); extern struct key_type *key_type_lookup(const char *type); extern void key_type_put(struct key_type *ktype); extern int __key_link_lock(struct key *keyring, const struct keyring_index_key *index_key); extern int __key_move_lock(struct key *l_keyring, struct key *u_keyring, const struct keyring_index_key *index_key); extern int __key_link_begin(struct key *keyring, const struct keyring_index_key *index_key, struct assoc_array_edit **_edit); extern int __key_link_check_live_key(struct key *keyring, struct key *key); extern void __key_link(struct key *keyring, struct key *key, struct assoc_array_edit **_edit); extern void __key_link_end(struct key *keyring, const struct keyring_index_key *index_key, struct assoc_array_edit *edit); extern key_ref_t find_key_to_update(key_ref_t keyring_ref, const struct keyring_index_key *index_key); struct keyring_search_context { struct keyring_index_key index_key; const struct cred *cred; struct key_match_data match_data; unsigned flags; #define KEYRING_SEARCH_NO_STATE_CHECK 0x0001 /* Skip state checks */ #define KEYRING_SEARCH_DO_STATE_CHECK 0x0002 /* Override NO_STATE_CHECK */ #define KEYRING_SEARCH_NO_UPDATE_TIME 0x0004 /* Don't update times */ #define KEYRING_SEARCH_NO_CHECK_PERM 0x0008 /* Don't check permissions */ #define KEYRING_SEARCH_DETECT_TOO_DEEP 0x0010 /* Give an error on excessive depth */ #define KEYRING_SEARCH_SKIP_EXPIRED 0x0020 /* Ignore expired keys (intention to replace) */ #define KEYRING_SEARCH_RECURSE 0x0040 /* Search child keyrings also */ int (*iterator)(const void *object, void *iterator_data); /* Internal stuff */ int skipped_ret; bool possessed; key_ref_t result; time64_t now; }; extern bool key_default_cmp(const struct key *key, const struct key_match_data *match_data); extern key_ref_t keyring_search_rcu(key_ref_t keyring_ref, struct keyring_search_context *ctx); extern key_ref_t search_cred_keyrings_rcu(struct keyring_search_context *ctx); extern key_ref_t search_process_keyrings_rcu(struct keyring_search_context *ctx); extern struct key *find_keyring_by_name(const char *name, bool uid_keyring); extern int look_up_user_keyrings(struct key **, struct key **); extern struct key *get_user_session_keyring_rcu(const struct cred *); extern int install_thread_keyring_to_cred(struct cred *); extern int install_process_keyring_to_cred(struct cred *); extern int install_session_keyring_to_cred(struct cred *, struct key *); extern struct key *request_key_and_link(struct key_type *type, const char *description, struct key_tag *domain_tag, const void *callout_info, size_t callout_len, void *aux, struct key *dest_keyring, unsigned long flags); extern bool lookup_user_key_possessed(const struct key *key, const struct key_match_data *match_data); extern long join_session_keyring(const char *name); extern void key_change_session_keyring(struct callback_head *twork); extern struct work_struct key_gc_work; extern unsigned key_gc_delay; extern void keyring_gc(struct key *keyring, time64_t limit); extern void keyring_restriction_gc(struct key *keyring, struct key_type *dead_type); void key_set_expiry(struct key *key, time64_t expiry); extern void key_schedule_gc(time64_t gc_at); extern void key_schedule_gc_links(void); extern void key_gc_keytype(struct key_type *ktype); extern int key_task_permission(const key_ref_t key_ref, const struct cred *cred, enum key_need_perm need_perm); static inline void notify_key(struct key *key, enum key_notification_subtype subtype, u32 aux) { #ifdef CONFIG_KEY_NOTIFICATIONS struct key_notification n = { .watch.type = WATCH_TYPE_KEY_NOTIFY, .watch.subtype = subtype, .watch.info = watch_sizeof(n), .key_id = key_serial(key), .aux = aux, }; post_watch_notification(key->watchers, &n.watch, current_cred(), n.key_id); #endif } /* * Check to see whether permission is granted to use a key in the desired way. */ static inline int key_permission(const key_ref_t key_ref, enum key_need_perm need_perm) { return key_task_permission(key_ref, current_cred(), need_perm); } extern struct key_type key_type_request_key_auth; extern struct key *request_key_auth_new(struct key *target, const char *op, const void *callout_info, size_t callout_len, struct key *dest_keyring); extern struct key *key_get_instantiation_authkey(key_serial_t target_id); /* * Determine whether a key is dead. */ static inline bool key_is_dead(const struct key *key, time64_t limit) { time64_t expiry = key->expiry; if (expiry != TIME64_MAX) { if (!(key->type->flags & KEY_TYPE_INSTANT_REAP)) expiry += key_gc_delay; if (expiry <= limit) return true; } return key->flags & ((1 << KEY_FLAG_DEAD) | (1 << KEY_FLAG_INVALIDATED)) || key->domain_tag->removed; } /* * keyctl() functions */ extern long keyctl_get_keyring_ID(key_serial_t, int); extern long keyctl_join_session_keyring(const char __user *); extern long keyctl_update_key(key_serial_t, const void __user *, size_t); extern long keyctl_revoke_key(key_serial_t); extern long keyctl_keyring_clear(key_serial_t); extern long keyctl_keyring_link(key_serial_t, key_serial_t); extern long keyctl_keyring_move(key_serial_t, key_serial_t, key_serial_t, unsigned int); extern long keyctl_keyring_unlink(key_serial_t, key_serial_t); extern long keyctl_describe_key(key_serial_t, char __user *, size_t); extern long keyctl_keyring_search(key_serial_t, const char __user *, const char __user *, key_serial_t); extern long keyctl_read_key(key_serial_t, char __user *, size_t); extern long keyctl_chown_key(key_serial_t, uid_t, gid_t); extern long keyctl_setperm_key(key_serial_t, key_perm_t); extern long keyctl_instantiate_key(key_serial_t, const void __user *, size_t, key_serial_t); extern long keyctl_negate_key(key_serial_t, unsigned, key_serial_t); extern long keyctl_set_reqkey_keyring(int); extern long keyctl_set_timeout(key_serial_t, unsigned); extern long keyctl_assume_authority(key_serial_t); extern long keyctl_get_security(key_serial_t keyid, char __user *buffer, size_t buflen); extern long keyctl_session_to_parent(void); extern long keyctl_reject_key(key_serial_t, unsigned, unsigned, key_serial_t); extern long keyctl_instantiate_key_iov(key_serial_t, const struct iovec __user *, unsigned, key_serial_t); extern long keyctl_invalidate_key(key_serial_t); extern long keyctl_restrict_keyring(key_serial_t id, const char __user *_type, const char __user *_restriction); #ifdef CONFIG_PERSISTENT_KEYRINGS extern long keyctl_get_persistent(uid_t, key_serial_t); extern unsigned persistent_keyring_expiry; #else static inline long keyctl_get_persistent(uid_t uid, key_serial_t destring) { return -EOPNOTSUPP; } #endif #ifdef CONFIG_KEY_DH_OPERATIONS extern long keyctl_dh_compute(struct keyctl_dh_params __user *, char __user *, size_t, struct keyctl_kdf_params __user *); extern long __keyctl_dh_compute(struct keyctl_dh_params __user *, char __user *, size_t, struct keyctl_kdf_params *); #ifdef CONFIG_COMPAT extern long compat_keyctl_dh_compute(struct keyctl_dh_params __user *params, char __user *buffer, size_t buflen, struct compat_keyctl_kdf_params __user *kdf); #endif #define KEYCTL_KDF_MAX_OUTPUT_LEN 1024 /* max length of KDF output */ #define KEYCTL_KDF_MAX_OI_LEN 64 /* max length of otherinfo */ #else static inline long keyctl_dh_compute(struct keyctl_dh_params __user *params, char __user *buffer, size_t buflen, struct keyctl_kdf_params __user *kdf) { return -EOPNOTSUPP; } #ifdef CONFIG_COMPAT static inline long compat_keyctl_dh_compute( struct keyctl_dh_params __user *params, char __user *buffer, size_t buflen, struct keyctl_kdf_params __user *kdf) { return -EOPNOTSUPP; } #endif #endif #ifdef CONFIG_ASYMMETRIC_KEY_TYPE extern long keyctl_pkey_query(key_serial_t, const char __user *, struct keyctl_pkey_query __user *); extern long keyctl_pkey_verify(const struct keyctl_pkey_params __user *, const char __user *, const void __user *, const void __user *); extern long keyctl_pkey_e_d_s(int, const struct keyctl_pkey_params __user *, const char __user *, const void __user *, void __user *); #else static inline long keyctl_pkey_query(key_serial_t id, const char __user *_info, struct keyctl_pkey_query __user *_res) { return -EOPNOTSUPP; } static inline long keyctl_pkey_verify(const struct keyctl_pkey_params __user *params, const char __user *_info, const void __user *_in, const void __user *_in2) { return -EOPNOTSUPP; } static inline long keyctl_pkey_e_d_s(int op, const struct keyctl_pkey_params __user *params, const char __user *_info, const void __user *_in, void __user *_out) { return -EOPNOTSUPP; } #endif extern long keyctl_capabilities(unsigned char __user *_buffer, size_t buflen); #ifdef CONFIG_KEY_NOTIFICATIONS extern long keyctl_watch_key(key_serial_t, int, int); #else static inline long keyctl_watch_key(key_serial_t key_id, int watch_fd, int watch_id) { return -EOPNOTSUPP; } #endif /* * Debugging key validation */ #ifdef KEY_DEBUGGING extern void __key_check(const struct key *); static inline void key_check(const struct key *key) { if (key && (IS_ERR(key) || key->magic != KEY_DEBUG_MAGIC)) __key_check(key); } #else #define key_check(key) do {} while(0) #endif #endif /* _INTERNAL_H */ |
| 168 168 169 169 9 85 3 3 3 83 2 2 2 2 2 6 1 4 1 1 1 1 1 1 1 6 6 3 1 1 1 1 1 1 6 9 7 6 4 4 6 9 9 9 4 4 3 2 406 406 1 406 406 19 84 83 84 84 84 83 84 83 1 83 83 83 85 83 82 85 2 83 84 84 1 1 1 1 83 83 84 82 1 1 1 83 84 84 83 84 83 3 3 3 3 3 3 80 79 81 81 82 80 82 3 80 81 8 8 8 8 8 8 71 72 72 71 71 70 1 8 8 8 1 1 1 2 1 2 2 2 2 83 83 83 82 84 84 83 82 84 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 | // SPDX-License-Identifier: GPL-2.0-or-later /* audit.c -- Auditing support * Gateway between the kernel (e.g., selinux) and the user-space audit daemon. * System-call specific features have moved to auditsc.c * * Copyright 2003-2007 Red Hat Inc., Durham, North Carolina. * All Rights Reserved. * * Written by Rickard E. (Rik) Faith <faith@redhat.com> * * Goals: 1) Integrate fully with Security Modules. * 2) Minimal run-time overhead: * a) Minimal when syscall auditing is disabled (audit_enable=0). * b) Small when syscall auditing is enabled and no audit record * is generated (defer as much work as possible to record * generation time): * i) context is allocated, * ii) names from getname are stored without a copy, and * iii) inode information stored from path_lookup. * 3) Ability to disable syscall auditing at boot time (audit=0). * 4) Usable by other parts of the kernel (if audit_log* is called, * then a syscall record will be generated automatically for the * current syscall). * 5) Netlink interface to user-space. * 6) Support low-overhead kernel-based filtering to minimize the * information that must be passed to user-space. * * Audit userspace, documentation, tests, and bug/issue trackers: * https://github.com/linux-audit */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/file.h> #include <linux/init.h> #include <linux/types.h> #include <linux/atomic.h> #include <linux/mm.h> #include <linux/export.h> #include <linux/slab.h> #include <linux/err.h> #include <linux/kthread.h> #include <linux/kernel.h> #include <linux/syscalls.h> #include <linux/spinlock.h> #include <linux/rcupdate.h> #include <linux/mutex.h> #include <linux/gfp.h> #include <linux/pid.h> #include <linux/audit.h> #include <net/sock.h> #include <net/netlink.h> #include <linux/skbuff.h> #include <linux/security.h> #include <linux/lsm_hooks.h> #include <linux/freezer.h> #include <linux/pid_namespace.h> #include <net/netns/generic.h> #include "audit.h" /* No auditing will take place until audit_initialized == AUDIT_INITIALIZED. * (Initialization happens after skb_init is called.) */ #define AUDIT_DISABLED -1 #define AUDIT_UNINITIALIZED 0 #define AUDIT_INITIALIZED 1 static int audit_initialized = AUDIT_UNINITIALIZED; u32 audit_enabled = AUDIT_OFF; bool audit_ever_enabled = !!AUDIT_OFF; EXPORT_SYMBOL_GPL(audit_enabled); /* Default state when kernel boots without any parameters. */ static u32 audit_default = AUDIT_OFF; /* If auditing cannot proceed, audit_failure selects what happens. */ static u32 audit_failure = AUDIT_FAIL_PRINTK; /* private audit network namespace index */ static unsigned int audit_net_id; /* Number of modules that provide a security context. List of lsms that provide a security context */ static u32 audit_subj_secctx_cnt; static u32 audit_obj_secctx_cnt; static const struct lsm_id *audit_subj_lsms[MAX_LSM_COUNT]; static const struct lsm_id *audit_obj_lsms[MAX_LSM_COUNT]; /** * struct audit_net - audit private network namespace data * @sk: communication socket */ struct audit_net { struct sock *sk; }; /** * struct auditd_connection - kernel/auditd connection state * @pid: auditd PID * @portid: netlink portid * @net: the associated network namespace * @rcu: RCU head * * Description: * This struct is RCU protected; you must either hold the RCU lock for reading * or the associated spinlock for writing. */ struct auditd_connection { struct pid *pid; u32 portid; struct net *net; struct rcu_head rcu; }; static struct auditd_connection __rcu *auditd_conn; static DEFINE_SPINLOCK(auditd_conn_lock); /* If audit_rate_limit is non-zero, limit the rate of sending audit records * to that number per second. This prevents DoS attacks, but results in * audit records being dropped. */ static u32 audit_rate_limit; /* Number of outstanding audit_buffers allowed. * When set to zero, this means unlimited. */ static u32 audit_backlog_limit = 64; #define AUDIT_BACKLOG_WAIT_TIME (60 * HZ) static u32 audit_backlog_wait_time = AUDIT_BACKLOG_WAIT_TIME; /* The identity of the user shutting down the audit system. */ static kuid_t audit_sig_uid = INVALID_UID; static pid_t audit_sig_pid = -1; static struct lsm_prop audit_sig_lsm; /* Records can be lost in several ways: 0) [suppressed in audit_alloc] 1) out of memory in audit_log_start [kmalloc of struct audit_buffer] 2) out of memory in audit_log_move [alloc_skb] 3) suppressed due to audit_rate_limit 4) suppressed due to audit_backlog_limit */ static atomic_t audit_lost = ATOMIC_INIT(0); /* Monotonically increasing sum of time the kernel has spent * waiting while the backlog limit is exceeded. */ static atomic_t audit_backlog_wait_time_actual = ATOMIC_INIT(0); /* Hash for inode-based rules */ struct list_head audit_inode_hash[AUDIT_INODE_BUCKETS]; static struct kmem_cache *audit_buffer_cache; /* queue msgs to send via kauditd_task */ static struct sk_buff_head audit_queue; /* queue msgs due to temporary unicast send problems */ static struct sk_buff_head audit_retry_queue; /* queue msgs waiting for new auditd connection */ static struct sk_buff_head audit_hold_queue; /* queue servicing thread */ static struct task_struct *kauditd_task; static DECLARE_WAIT_QUEUE_HEAD(kauditd_wait); /* waitqueue for callers who are blocked on the audit backlog */ static DECLARE_WAIT_QUEUE_HEAD(audit_backlog_wait); static struct audit_features af = {.vers = AUDIT_FEATURE_VERSION, .mask = -1, .features = 0, .lock = 0,}; static char *audit_feature_names[2] = { "only_unset_loginuid", "loginuid_immutable", }; /** * struct audit_ctl_mutex - serialize requests from userspace * @lock: the mutex used for locking * @owner: the task which owns the lock * * Description: * This is the lock struct used to ensure we only process userspace requests * in an orderly fashion. We can't simply use a mutex/lock here because we * need to track lock ownership so we don't end up blocking the lock owner in * audit_log_start() or similar. */ static struct audit_ctl_mutex { struct mutex lock; void *owner; } audit_cmd_mutex; /* AUDIT_BUFSIZ is the size of the temporary buffer used for formatting * audit records. Since printk uses a 1024 byte buffer, this buffer * should be at least that large. */ #define AUDIT_BUFSIZ 1024 /* The audit_buffer is used when formatting an audit record. The caller * locks briefly to get the record off the freelist or to allocate the * buffer, and locks briefly to send the buffer to the netlink layer or * to place it on a transmit queue. Multiple audit_buffers can be in * use simultaneously. */ struct audit_buffer { struct sk_buff *skb; /* the skb for audit_log functions */ struct sk_buff_head skb_list; /* formatted skbs, ready to send */ struct audit_context *ctx; /* NULL or associated context */ struct audit_stamp stamp; /* audit stamp for these records */ gfp_t gfp_mask; }; struct audit_reply { __u32 portid; struct net *net; struct sk_buff *skb; }; /** * auditd_test_task - Check to see if a given task is an audit daemon * @task: the task to check * * Description: * Return 1 if the task is a registered audit daemon, 0 otherwise. */ int auditd_test_task(struct task_struct *task) { int rc; struct auditd_connection *ac; rcu_read_lock(); ac = rcu_dereference(auditd_conn); rc = (ac && ac->pid == task_tgid(task) ? 1 : 0); rcu_read_unlock(); return rc; } /** * audit_ctl_lock - Take the audit control lock */ void audit_ctl_lock(void) { mutex_lock(&audit_cmd_mutex.lock); audit_cmd_mutex.owner = current; } /** * audit_ctl_unlock - Drop the audit control lock */ void audit_ctl_unlock(void) { audit_cmd_mutex.owner = NULL; mutex_unlock(&audit_cmd_mutex.lock); } /** * audit_ctl_owner_current - Test to see if the current task owns the lock * * Description: * Return true if the current task owns the audit control lock, false if it * doesn't own the lock. */ static bool audit_ctl_owner_current(void) { return (current == audit_cmd_mutex.owner); } /** * auditd_pid_vnr - Return the auditd PID relative to the namespace * * Description: * Returns the PID in relation to the namespace, 0 on failure. */ static pid_t auditd_pid_vnr(void) { pid_t pid; const struct auditd_connection *ac; rcu_read_lock(); ac = rcu_dereference(auditd_conn); if (!ac || !ac->pid) pid = 0; else pid = pid_vnr(ac->pid); rcu_read_unlock(); return pid; } /** * audit_cfg_lsm - Identify a security module as providing a secctx. * @lsmid: LSM identity * @flags: which contexts are provided * * Description: * Increments the count of the security modules providing a secctx. * If the LSM id is already in the list leave it alone. */ void audit_cfg_lsm(const struct lsm_id *lsmid, int flags) { int i; if (flags & AUDIT_CFG_LSM_SECCTX_SUBJECT) { for (i = 0 ; i < audit_subj_secctx_cnt; i++) if (audit_subj_lsms[i] == lsmid) return; audit_subj_lsms[audit_subj_secctx_cnt++] = lsmid; } if (flags & AUDIT_CFG_LSM_SECCTX_OBJECT) { for (i = 0 ; i < audit_obj_secctx_cnt; i++) if (audit_obj_lsms[i] == lsmid) return; audit_obj_lsms[audit_obj_secctx_cnt++] = lsmid; } } /** * audit_get_sk - Return the audit socket for the given network namespace * @net: the destination network namespace * * Description: * Returns the sock pointer if valid, NULL otherwise. The caller must ensure * that a reference is held for the network namespace while the sock is in use. */ static struct sock *audit_get_sk(const struct net *net) { struct audit_net *aunet; if (!net) return NULL; aunet = net_generic(net, audit_net_id); return aunet->sk; } void audit_panic(const char *message) { switch (audit_failure) { case AUDIT_FAIL_SILENT: break; case AUDIT_FAIL_PRINTK: if (printk_ratelimit()) pr_err("%s\n", message); break; case AUDIT_FAIL_PANIC: panic("audit: %s\n", message); break; } } static inline int audit_rate_check(void) { static unsigned long last_check = 0; static int messages = 0; static DEFINE_SPINLOCK(lock); unsigned long flags; unsigned long now; int retval = 0; if (!audit_rate_limit) return 1; spin_lock_irqsave(&lock, flags); if (++messages < audit_rate_limit) { retval = 1; } else { now = jiffies; if (time_after(now, last_check + HZ)) { last_check = now; messages = 0; retval = 1; } } spin_unlock_irqrestore(&lock, flags); return retval; } /** * audit_log_lost - conditionally log lost audit message event * @message: the message stating reason for lost audit message * * Emit at least 1 message per second, even if audit_rate_check is * throttling. * Always increment the lost messages counter. */ void audit_log_lost(const char *message) { static unsigned long last_msg = 0; static DEFINE_SPINLOCK(lock); unsigned long flags; unsigned long now; int print; atomic_inc(&audit_lost); print = (audit_failure == AUDIT_FAIL_PANIC || !audit_rate_limit); if (!print) { spin_lock_irqsave(&lock, flags); now = jiffies; if (time_after(now, last_msg + HZ)) { print = 1; last_msg = now; } spin_unlock_irqrestore(&lock, flags); } if (print) { if (printk_ratelimit()) pr_warn("audit_lost=%u audit_rate_limit=%u audit_backlog_limit=%u\n", atomic_read(&audit_lost), audit_rate_limit, audit_backlog_limit); audit_panic(message); } } static int audit_log_config_change(char *function_name, u32 new, u32 old, int allow_changes) { struct audit_buffer *ab; int rc = 0; ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONFIG_CHANGE); if (unlikely(!ab)) return rc; audit_log_format(ab, "op=set %s=%u old=%u ", function_name, new, old); audit_log_session_info(ab); rc = audit_log_task_context(ab); if (rc) allow_changes = 0; /* Something weird, deny request */ audit_log_format(ab, " res=%d", allow_changes); audit_log_end(ab); return rc; } static int audit_do_config_change(char *function_name, u32 *to_change, u32 new) { int allow_changes, rc = 0; u32 old = *to_change; /* check if we are locked */ if (audit_enabled == AUDIT_LOCKED) allow_changes = 0; else allow_changes = 1; if (audit_enabled != AUDIT_OFF) { rc = audit_log_config_change(function_name, new, old, allow_changes); if (rc) allow_changes = 0; } /* If we are allowed, make the change */ if (allow_changes == 1) *to_change = new; /* Not allowed, update reason */ else if (rc == 0) rc = -EPERM; return rc; } static int audit_set_rate_limit(u32 limit) { return audit_do_config_change("audit_rate_limit", &audit_rate_limit, limit); } static int audit_set_backlog_limit(u32 limit) { return audit_do_config_change("audit_backlog_limit", &audit_backlog_limit, limit); } static int audit_set_backlog_wait_time(u32 timeout) { return audit_do_config_change("audit_backlog_wait_time", &audit_backlog_wait_time, timeout); } static int audit_set_enabled(u32 state) { int rc; if (state > AUDIT_LOCKED) return -EINVAL; rc = audit_do_config_change("audit_enabled", &audit_enabled, state); if (!rc) audit_ever_enabled |= !!state; return rc; } static int audit_set_failure(u32 state) { if (state != AUDIT_FAIL_SILENT && state != AUDIT_FAIL_PRINTK && state != AUDIT_FAIL_PANIC) return -EINVAL; return audit_do_config_change("audit_failure", &audit_failure, state); } /** * auditd_conn_free - RCU helper to release an auditd connection struct * @rcu: RCU head * * Description: * Drop any references inside the auditd connection tracking struct and free * the memory. */ static void auditd_conn_free(struct rcu_head *rcu) { struct auditd_connection *ac; ac = container_of(rcu, struct auditd_connection, rcu); put_pid(ac->pid); put_net(ac->net); kfree(ac); } /** * auditd_set - Set/Reset the auditd connection state * @pid: auditd PID * @portid: auditd netlink portid * @net: auditd network namespace pointer * @skb: the netlink command from the audit daemon * @ack: netlink ack flag, cleared if ack'd here * * Description: * This function will obtain and drop network namespace references as * necessary. Returns zero on success, negative values on failure. */ static int auditd_set(struct pid *pid, u32 portid, struct net *net, struct sk_buff *skb, bool *ack) { unsigned long flags; struct auditd_connection *ac_old, *ac_new; struct nlmsghdr *nlh; if (!pid || !net) return -EINVAL; ac_new = kzalloc(sizeof(*ac_new), GFP_KERNEL); if (!ac_new) return -ENOMEM; ac_new->pid = get_pid(pid); ac_new->portid = portid; ac_new->net = get_net(net); /* send the ack now to avoid a race with the queue backlog */ if (*ack) { nlh = nlmsg_hdr(skb); netlink_ack(skb, nlh, 0, NULL); *ack = false; } spin_lock_irqsave(&auditd_conn_lock, flags); ac_old = rcu_dereference_protected(auditd_conn, lockdep_is_held(&auditd_conn_lock)); rcu_assign_pointer(auditd_conn, ac_new); spin_unlock_irqrestore(&auditd_conn_lock, flags); if (ac_old) call_rcu(&ac_old->rcu, auditd_conn_free); return 0; } /** * kauditd_printk_skb - Print the audit record to the ring buffer * @skb: audit record * * Whatever the reason, this packet may not make it to the auditd connection * so write it via printk so the information isn't completely lost. */ static void kauditd_printk_skb(struct sk_buff *skb) { struct nlmsghdr *nlh = nlmsg_hdr(skb); char *data = nlmsg_data(nlh); if (nlh->nlmsg_type != AUDIT_EOE && printk_ratelimit()) pr_notice("type=%d %s\n", nlh->nlmsg_type, data); } /** * kauditd_rehold_skb - Handle a audit record send failure in the hold queue * @skb: audit record * @error: error code (unused) * * Description: * This should only be used by the kauditd_thread when it fails to flush the * hold queue. */ static void kauditd_rehold_skb(struct sk_buff *skb, __always_unused int error) { /* put the record back in the queue */ skb_queue_tail(&audit_hold_queue, skb); } /** * kauditd_hold_skb - Queue an audit record, waiting for auditd * @skb: audit record * @error: error code * * Description: * Queue the audit record, waiting for an instance of auditd. When this * function is called we haven't given up yet on sending the record, but things * are not looking good. The first thing we want to do is try to write the * record via printk and then see if we want to try and hold on to the record * and queue it, if we have room. If we want to hold on to the record, but we * don't have room, record a record lost message. */ static void kauditd_hold_skb(struct sk_buff *skb, int error) { /* at this point it is uncertain if we will ever send this to auditd so * try to send the message via printk before we go any further */ kauditd_printk_skb(skb); /* can we just silently drop the message? */ if (!audit_default) goto drop; /* the hold queue is only for when the daemon goes away completely, * not -EAGAIN failures; if we are in a -EAGAIN state requeue the * record on the retry queue unless it's full, in which case drop it */ if (error == -EAGAIN) { if (!audit_backlog_limit || skb_queue_len(&audit_retry_queue) < audit_backlog_limit) { skb_queue_tail(&audit_retry_queue, skb); return; } audit_log_lost("kauditd retry queue overflow"); goto drop; } /* if we have room in the hold queue, queue the message */ if (!audit_backlog_limit || skb_queue_len(&audit_hold_queue) < audit_backlog_limit) { skb_queue_tail(&audit_hold_queue, skb); return; } /* we have no other options - drop the message */ audit_log_lost("kauditd hold queue overflow"); drop: kfree_skb(skb); } /** * kauditd_retry_skb - Queue an audit record, attempt to send again to auditd * @skb: audit record * @error: error code (unused) * * Description: * Not as serious as kauditd_hold_skb() as we still have a connected auditd, * but for some reason we are having problems sending it audit records so * queue the given record and attempt to resend. */ static void kauditd_retry_skb(struct sk_buff *skb, __always_unused int error) { if (!audit_backlog_limit || skb_queue_len(&audit_retry_queue) < audit_backlog_limit) { skb_queue_tail(&audit_retry_queue, skb); return; } /* we have to drop the record, send it via printk as a last effort */ kauditd_printk_skb(skb); audit_log_lost("kauditd retry queue overflow"); kfree_skb(skb); } /** * auditd_reset - Disconnect the auditd connection * @ac: auditd connection state * * Description: * Break the auditd/kauditd connection and move all the queued records into the * hold queue in case auditd reconnects. It is important to note that the @ac * pointer should never be dereferenced inside this function as it may be NULL * or invalid, you can only compare the memory address! If @ac is NULL then * the connection will always be reset. */ static void auditd_reset(const struct auditd_connection *ac) { unsigned long flags; struct sk_buff *skb; struct auditd_connection *ac_old; /* if it isn't already broken, break the connection */ spin_lock_irqsave(&auditd_conn_lock, flags); ac_old = rcu_dereference_protected(auditd_conn, lockdep_is_held(&auditd_conn_lock)); if (ac && ac != ac_old) { /* someone already registered a new auditd connection */ spin_unlock_irqrestore(&auditd_conn_lock, flags); return; } rcu_assign_pointer(auditd_conn, NULL); spin_unlock_irqrestore(&auditd_conn_lock, flags); if (ac_old) call_rcu(&ac_old->rcu, auditd_conn_free); /* flush the retry queue to the hold queue, but don't touch the main * queue since we need to process that normally for multicast */ while ((skb = skb_dequeue(&audit_retry_queue))) kauditd_hold_skb(skb, -ECONNREFUSED); } /** * auditd_send_unicast_skb - Send a record via unicast to auditd * @skb: audit record * * Description: * Send a skb to the audit daemon, returns positive/zero values on success and * negative values on failure; in all cases the skb will be consumed by this * function. If the send results in -ECONNREFUSED the connection with auditd * will be reset. This function may sleep so callers should not hold any locks * where this would cause a problem. */ static int auditd_send_unicast_skb(struct sk_buff *skb) { int rc; u32 portid; struct net *net; struct sock *sk; struct auditd_connection *ac; /* NOTE: we can't call netlink_unicast while in the RCU section so * take a reference to the network namespace and grab local * copies of the namespace, the sock, and the portid; the * namespace and sock aren't going to go away while we hold a * reference and if the portid does become invalid after the RCU * section netlink_unicast() should safely return an error */ rcu_read_lock(); ac = rcu_dereference(auditd_conn); if (!ac) { rcu_read_unlock(); kfree_skb(skb); rc = -ECONNREFUSED; goto err; } net = get_net(ac->net); sk = audit_get_sk(net); portid = ac->portid; rcu_read_unlock(); rc = netlink_unicast(sk, skb, portid, 0); put_net(net); if (rc < 0) goto err; return rc; err: if (ac && rc == -ECONNREFUSED) auditd_reset(ac); return rc; } /** * kauditd_send_queue - Helper for kauditd_thread to flush skb queues * @sk: the sending sock * @portid: the netlink destination * @queue: the skb queue to process * @retry_limit: limit on number of netlink unicast failures * @skb_hook: per-skb hook for additional processing * @err_hook: hook called if the skb fails the netlink unicast send * * Description: * Run through the given queue and attempt to send the audit records to auditd, * returns zero on success, negative values on failure. It is up to the caller * to ensure that the @sk is valid for the duration of this function. * */ static int kauditd_send_queue(struct sock *sk, u32 portid, struct sk_buff_head *queue, unsigned int retry_limit, void (*skb_hook)(struct sk_buff *skb), void (*err_hook)(struct sk_buff *skb, int error)) { int rc = 0; struct sk_buff *skb = NULL; struct sk_buff *skb_tail; unsigned int failed = 0; /* NOTE: kauditd_thread takes care of all our locking, we just use * the netlink info passed to us (e.g. sk and portid) */ skb_tail = skb_peek_tail(queue); while ((skb != skb_tail) && (skb = skb_dequeue(queue))) { /* call the skb_hook for each skb we touch */ if (skb_hook) (*skb_hook)(skb); /* can we send to anyone via unicast? */ if (!sk) { if (err_hook) (*err_hook)(skb, -ECONNREFUSED); continue; } retry: /* grab an extra skb reference in case of error */ skb_get(skb); rc = netlink_unicast(sk, skb, portid, 0); if (rc < 0) { /* send failed - try a few times unless fatal error */ if (++failed >= retry_limit || rc == -ECONNREFUSED || rc == -EPERM) { sk = NULL; if (err_hook) (*err_hook)(skb, rc); if (rc == -EAGAIN) rc = 0; /* continue to drain the queue */ continue; } else goto retry; } else { /* skb sent - drop the extra reference and continue */ consume_skb(skb); failed = 0; } } return (rc >= 0 ? 0 : rc); } /* * kauditd_send_multicast_skb - Send a record to any multicast listeners * @skb: audit record * * Description: * Write a multicast message to anyone listening in the initial network * namespace. This function doesn't consume an skb as might be expected since * it has to copy it anyways. */ static void kauditd_send_multicast_skb(struct sk_buff *skb) { struct sk_buff *copy; struct sock *sock = audit_get_sk(&init_net); struct nlmsghdr *nlh; /* NOTE: we are not taking an additional reference for init_net since * we don't have to worry about it going away */ if (!netlink_has_listeners(sock, AUDIT_NLGRP_READLOG)) return; /* * The seemingly wasteful skb_copy() rather than bumping the refcount * using skb_get() is necessary because non-standard mods are made to * the skb by the original kaudit unicast socket send routine. The * existing auditd daemon assumes this breakage. Fixing this would * require co-ordinating a change in the established protocol between * the kaudit kernel subsystem and the auditd userspace code. There is * no reason for new multicast clients to continue with this * non-compliance. */ copy = skb_copy(skb, GFP_KERNEL); if (!copy) return; nlh = nlmsg_hdr(copy); nlh->nlmsg_len = skb->len; nlmsg_multicast(sock, copy, 0, AUDIT_NLGRP_READLOG, GFP_KERNEL); } /** * kauditd_thread - Worker thread to send audit records to userspace * @dummy: unused */ static int kauditd_thread(void *dummy) { int rc; u32 portid = 0; struct net *net = NULL; struct sock *sk = NULL; struct auditd_connection *ac; #define UNICAST_RETRIES 5 set_freezable(); while (!kthread_should_stop()) { /* NOTE: see the lock comments in auditd_send_unicast_skb() */ rcu_read_lock(); ac = rcu_dereference(auditd_conn); if (!ac) { rcu_read_unlock(); goto main_queue; } net = get_net(ac->net); sk = audit_get_sk(net); portid = ac->portid; rcu_read_unlock(); /* attempt to flush the hold queue */ rc = kauditd_send_queue(sk, portid, &audit_hold_queue, UNICAST_RETRIES, NULL, kauditd_rehold_skb); if (rc < 0) { sk = NULL; auditd_reset(ac); goto main_queue; } /* attempt to flush the retry queue */ rc = kauditd_send_queue(sk, portid, &audit_retry_queue, UNICAST_RETRIES, NULL, kauditd_hold_skb); if (rc < 0) { sk = NULL; auditd_reset(ac); goto main_queue; } main_queue: /* process the main queue - do the multicast send and attempt * unicast, dump failed record sends to the retry queue; if * sk == NULL due to previous failures we will just do the * multicast send and move the record to the hold queue */ rc = kauditd_send_queue(sk, portid, &audit_queue, 1, kauditd_send_multicast_skb, (sk ? kauditd_retry_skb : kauditd_hold_skb)); if (ac && rc < 0) auditd_reset(ac); sk = NULL; /* drop our netns reference, no auditd sends past this line */ if (net) { put_net(net); net = NULL; } /* we have processed all the queues so wake everyone */ wake_up(&audit_backlog_wait); /* NOTE: we want to wake up if there is anything on the queue, * regardless of if an auditd is connected, as we need to * do the multicast send and rotate records from the * main queue to the retry/hold queues */ wait_event_freezable(kauditd_wait, (skb_queue_len(&audit_queue) ? 1 : 0)); } return 0; } int audit_send_list_thread(void *_dest) { struct audit_netlink_list *dest = _dest; struct sk_buff *skb; struct sock *sk = audit_get_sk(dest->net); /* wait for parent to finish and send an ACK */ audit_ctl_lock(); audit_ctl_unlock(); while ((skb = __skb_dequeue(&dest->q)) != NULL) netlink_unicast(sk, skb, dest->portid, 0); put_net(dest->net); kfree(dest); return 0; } struct sk_buff *audit_make_reply(int seq, int type, int done, int multi, const void *payload, int size) { struct sk_buff *skb; struct nlmsghdr *nlh; void *data; int flags = multi ? NLM_F_MULTI : 0; int t = done ? NLMSG_DONE : type; skb = nlmsg_new(size, GFP_KERNEL); if (!skb) return NULL; nlh = nlmsg_put(skb, 0, seq, t, size, flags); if (!nlh) goto out_kfree_skb; data = nlmsg_data(nlh); memcpy(data, payload, size); return skb; out_kfree_skb: kfree_skb(skb); return NULL; } static void audit_free_reply(struct audit_reply *reply) { if (!reply) return; kfree_skb(reply->skb); if (reply->net) put_net(reply->net); kfree(reply); } static int audit_send_reply_thread(void *arg) { struct audit_reply *reply = (struct audit_reply *)arg; audit_ctl_lock(); audit_ctl_unlock(); /* Ignore failure. It'll only happen if the sender goes away, because our timeout is set to infinite. */ netlink_unicast(audit_get_sk(reply->net), reply->skb, reply->portid, 0); reply->skb = NULL; audit_free_reply(reply); return 0; } /** * audit_send_reply - send an audit reply message via netlink * @request_skb: skb of request we are replying to (used to target the reply) * @seq: sequence number * @type: audit message type * @done: done (last) flag * @multi: multi-part message flag * @payload: payload data * @size: payload size * * Allocates a skb, builds the netlink message, and sends it to the port id. */ static void audit_send_reply(struct sk_buff *request_skb, int seq, int type, int done, int multi, const void *payload, int size) { struct task_struct *tsk; struct audit_reply *reply; reply = kzalloc(sizeof(*reply), GFP_KERNEL); if (!reply) return; reply->skb = audit_make_reply(seq, type, done, multi, payload, size); if (!reply->skb) goto err; reply->net = get_net(sock_net(NETLINK_CB(request_skb).sk)); reply->portid = NETLINK_CB(request_skb).portid; tsk = kthread_run(audit_send_reply_thread, reply, "audit_send_reply"); if (IS_ERR(tsk)) goto err; return; err: audit_free_reply(reply); } /* * Check for appropriate CAP_AUDIT_ capabilities on incoming audit * control messages. */ static int audit_netlink_ok(struct sk_buff *skb, u16 msg_type) { int err = 0; /* Only support initial user namespace for now. */ /* * We return ECONNREFUSED because it tricks userspace into thinking * that audit was not configured into the kernel. Lots of users * configure their PAM stack (because that's what the distro does) * to reject login if unable to send messages to audit. If we return * ECONNREFUSED the PAM stack thinks the kernel does not have audit * configured in and will let login proceed. If we return EPERM * userspace will reject all logins. This should be removed when we * support non init namespaces!! */ if (current_user_ns() != &init_user_ns) return -ECONNREFUSED; switch (msg_type) { case AUDIT_LIST: case AUDIT_ADD: case AUDIT_DEL: return -EOPNOTSUPP; case AUDIT_GET: case AUDIT_SET: case AUDIT_GET_FEATURE: case AUDIT_SET_FEATURE: case AUDIT_LIST_RULES: case AUDIT_ADD_RULE: case AUDIT_DEL_RULE: case AUDIT_SIGNAL_INFO: case AUDIT_TTY_GET: case AUDIT_TTY_SET: case AUDIT_TRIM: case AUDIT_MAKE_EQUIV: /* Only support auditd and auditctl in initial pid namespace * for now. */ if (task_active_pid_ns(current) != &init_pid_ns) return -EPERM; if (!netlink_capable(skb, CAP_AUDIT_CONTROL)) err = -EPERM; break; case AUDIT_USER: case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG: case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2: if (!netlink_capable(skb, CAP_AUDIT_WRITE)) err = -EPERM; break; default: /* bad msg */ err = -EINVAL; } return err; } static void audit_log_common_recv_msg(struct audit_context *context, struct audit_buffer **ab, u16 msg_type) { uid_t uid = from_kuid(&init_user_ns, current_uid()); pid_t pid = task_tgid_nr(current); if (!audit_enabled && msg_type != AUDIT_USER_AVC) { *ab = NULL; return; } *ab = audit_log_start(context, GFP_KERNEL, msg_type); if (unlikely(!*ab)) return; audit_log_format(*ab, "pid=%d uid=%u ", pid, uid); audit_log_session_info(*ab); audit_log_task_context(*ab); } static inline void audit_log_user_recv_msg(struct audit_buffer **ab, u16 msg_type) { audit_log_common_recv_msg(NULL, ab, msg_type); } static int is_audit_feature_set(int i) { return af.features & AUDIT_FEATURE_TO_MASK(i); } static int audit_get_feature(struct sk_buff *skb) { u32 seq; seq = nlmsg_hdr(skb)->nlmsg_seq; audit_send_reply(skb, seq, AUDIT_GET_FEATURE, 0, 0, &af, sizeof(af)); return 0; } static void audit_log_feature_change(int which, u32 old_feature, u32 new_feature, u32 old_lock, u32 new_lock, int res) { struct audit_buffer *ab; if (audit_enabled == AUDIT_OFF) return; ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_FEATURE_CHANGE); if (!ab) return; audit_log_task_info(ab); audit_log_format(ab, " feature=%s old=%u new=%u old_lock=%u new_lock=%u res=%d", audit_feature_names[which], !!old_feature, !!new_feature, !!old_lock, !!new_lock, res); audit_log_end(ab); } static int audit_set_feature(struct audit_features *uaf) { int i; BUILD_BUG_ON(AUDIT_LAST_FEATURE + 1 > ARRAY_SIZE(audit_feature_names)); /* if there is ever a version 2 we should handle that here */ for (i = 0; i <= AUDIT_LAST_FEATURE; i++) { u32 feature = AUDIT_FEATURE_TO_MASK(i); u32 old_feature, new_feature, old_lock, new_lock; /* if we are not changing this feature, move along */ if (!(feature & uaf->mask)) continue; old_feature = af.features & feature; new_feature = uaf->features & feature; new_lock = (uaf->lock | af.lock) & feature; old_lock = af.lock & feature; /* are we changing a locked feature? */ if (old_lock && (new_feature != old_feature)) { audit_log_feature_change(i, old_feature, new_feature, old_lock, new_lock, 0); return -EPERM; } } /* nothing invalid, do the changes */ for (i = 0; i <= AUDIT_LAST_FEATURE; i++) { u32 feature = AUDIT_FEATURE_TO_MASK(i); u32 old_feature, new_feature, old_lock, new_lock; /* if we are not changing this feature, move along */ if (!(feature & uaf->mask)) continue; old_feature = af.features & feature; new_feature = uaf->features & feature; old_lock = af.lock & feature; new_lock = (uaf->lock | af.lock) & feature; if (new_feature != old_feature) audit_log_feature_change(i, old_feature, new_feature, old_lock, new_lock, 1); if (new_feature) af.features |= feature; else af.features &= ~feature; af.lock |= new_lock; } return 0; } static int audit_replace(struct pid *pid) { pid_t pvnr; struct sk_buff *skb; pvnr = pid_vnr(pid); skb = audit_make_reply(0, AUDIT_REPLACE, 0, 0, &pvnr, sizeof(pvnr)); if (!skb) return -ENOMEM; return auditd_send_unicast_skb(skb); } static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh, bool *ack) { u32 seq; void *data; int data_len; int err; struct audit_buffer *ab; u16 msg_type = nlh->nlmsg_type; struct audit_sig_info *sig_data; struct lsm_context lsmctx = { NULL, 0, 0 }; err = audit_netlink_ok(skb, msg_type); if (err) return err; seq = nlh->nlmsg_seq; data = nlmsg_data(nlh); data_len = nlmsg_len(nlh); switch (msg_type) { case AUDIT_GET: { struct audit_status s; memset(&s, 0, sizeof(s)); s.enabled = audit_enabled; s.failure = audit_failure; /* NOTE: use pid_vnr() so the PID is relative to the current * namespace */ s.pid = auditd_pid_vnr(); s.rate_limit = audit_rate_limit; s.backlog_limit = audit_backlog_limit; s.lost = atomic_read(&audit_lost); s.backlog = skb_queue_len(&audit_queue); s.feature_bitmap = AUDIT_FEATURE_BITMAP_ALL; s.backlog_wait_time = audit_backlog_wait_time; s.backlog_wait_time_actual = atomic_read(&audit_backlog_wait_time_actual); audit_send_reply(skb, seq, AUDIT_GET, 0, 0, &s, sizeof(s)); break; } case AUDIT_SET: { struct audit_status s; memset(&s, 0, sizeof(s)); /* guard against past and future API changes */ memcpy(&s, data, min_t(size_t, sizeof(s), data_len)); if (s.mask & AUDIT_STATUS_ENABLED) { err = audit_set_enabled(s.enabled); if (err < 0) return err; } if (s.mask & AUDIT_STATUS_FAILURE) { err = audit_set_failure(s.failure); if (err < 0) return err; } if (s.mask & AUDIT_STATUS_PID) { /* NOTE: we are using the vnr PID functions below * because the s.pid value is relative to the * namespace of the caller; at present this * doesn't matter much since you can really only * run auditd from the initial pid namespace, but * something to keep in mind if this changes */ pid_t new_pid = s.pid; pid_t auditd_pid; struct pid *req_pid = task_tgid(current); /* Sanity check - PID values must match. Setting * pid to 0 is how auditd ends auditing. */ if (new_pid && (new_pid != pid_vnr(req_pid))) return -EINVAL; /* test the auditd connection */ audit_replace(req_pid); auditd_pid = auditd_pid_vnr(); if (auditd_pid) { /* replacing a healthy auditd is not allowed */ if (new_pid) { audit_log_config_change("audit_pid", new_pid, auditd_pid, 0); return -EEXIST; } /* only current auditd can unregister itself */ if (pid_vnr(req_pid) != auditd_pid) { audit_log_config_change("audit_pid", new_pid, auditd_pid, 0); return -EACCES; } } if (new_pid) { /* register a new auditd connection */ err = auditd_set(req_pid, NETLINK_CB(skb).portid, sock_net(NETLINK_CB(skb).sk), skb, ack); if (audit_enabled != AUDIT_OFF) audit_log_config_change("audit_pid", new_pid, auditd_pid, err ? 0 : 1); if (err) return err; /* try to process any backlog */ wake_up_interruptible(&kauditd_wait); } else { if (audit_enabled != AUDIT_OFF) audit_log_config_change("audit_pid", new_pid, auditd_pid, 1); /* unregister the auditd connection */ auditd_reset(NULL); } } if (s.mask & AUDIT_STATUS_RATE_LIMIT) { err = audit_set_rate_limit(s.rate_limit); if (err < 0) return err; } if (s.mask & AUDIT_STATUS_BACKLOG_LIMIT) { err = audit_set_backlog_limit(s.backlog_limit); if (err < 0) return err; } if (s.mask & AUDIT_STATUS_BACKLOG_WAIT_TIME) { if (sizeof(s) > (size_t)nlh->nlmsg_len) return -EINVAL; if (s.backlog_wait_time > 10*AUDIT_BACKLOG_WAIT_TIME) return -EINVAL; err = audit_set_backlog_wait_time(s.backlog_wait_time); if (err < 0) return err; } if (s.mask == AUDIT_STATUS_LOST) { u32 lost = atomic_xchg(&audit_lost, 0); audit_log_config_change("lost", 0, lost, 1); return lost; } if (s.mask == AUDIT_STATUS_BACKLOG_WAIT_TIME_ACTUAL) { u32 actual = atomic_xchg(&audit_backlog_wait_time_actual, 0); audit_log_config_change("backlog_wait_time_actual", 0, actual, 1); return actual; } break; } case AUDIT_GET_FEATURE: err = audit_get_feature(skb); if (err) return err; break; case AUDIT_SET_FEATURE: if (data_len < sizeof(struct audit_features)) return -EINVAL; err = audit_set_feature(data); if (err) return err; break; case AUDIT_USER: case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG: case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2: if (!audit_enabled && msg_type != AUDIT_USER_AVC) return 0; /* exit early if there isn't at least one character to print */ if (data_len < 2) return -EINVAL; err = audit_filter(msg_type, AUDIT_FILTER_USER); if (err == 1) { /* match or error */ char *str = data; err = 0; if (msg_type == AUDIT_USER_TTY) { err = tty_audit_push(); if (err) break; } audit_log_user_recv_msg(&ab, msg_type); if (msg_type != AUDIT_USER_TTY) { /* ensure NULL termination */ str[data_len - 1] = '\0'; audit_log_format(ab, " msg='%.*s'", AUDIT_MESSAGE_TEXT_MAX, str); } else { audit_log_format(ab, " data="); if (str[data_len - 1] == '\0') data_len--; audit_log_n_untrustedstring(ab, str, data_len); } audit_log_end(ab); } break; case AUDIT_ADD_RULE: case AUDIT_DEL_RULE: if (data_len < sizeof(struct audit_rule_data)) return -EINVAL; if (audit_enabled == AUDIT_LOCKED) { audit_log_common_recv_msg(audit_context(), &ab, AUDIT_CONFIG_CHANGE); audit_log_format(ab, " op=%s audit_enabled=%d res=0", msg_type == AUDIT_ADD_RULE ? "add_rule" : "remove_rule", audit_enabled); audit_log_end(ab); return -EPERM; } err = audit_rule_change(msg_type, seq, data, data_len); break; case AUDIT_LIST_RULES: err = audit_list_rules_send(skb, seq); break; case AUDIT_TRIM: audit_trim_trees(); audit_log_common_recv_msg(audit_context(), &ab, AUDIT_CONFIG_CHANGE); audit_log_format(ab, " op=trim res=1"); audit_log_end(ab); break; case AUDIT_MAKE_EQUIV: { void *bufp = data; u32 sizes[2]; size_t msglen = data_len; char *old, *new; err = -EINVAL; if (msglen < 2 * sizeof(u32)) break; memcpy(sizes, bufp, 2 * sizeof(u32)); bufp += 2 * sizeof(u32); msglen -= 2 * sizeof(u32); old = audit_unpack_string(&bufp, &msglen, sizes[0]); if (IS_ERR(old)) { err = PTR_ERR(old); break; } new = audit_unpack_string(&bufp, &msglen, sizes[1]); if (IS_ERR(new)) { err = PTR_ERR(new); kfree(old); break; } /* OK, here comes... */ err = audit_tag_tree(old, new); audit_log_common_recv_msg(audit_context(), &ab, AUDIT_CONFIG_CHANGE); audit_log_format(ab, " op=make_equiv old="); audit_log_untrustedstring(ab, old); audit_log_format(ab, " new="); audit_log_untrustedstring(ab, new); audit_log_format(ab, " res=%d", !err); audit_log_end(ab); kfree(old); kfree(new); break; } case AUDIT_SIGNAL_INFO: if (lsmprop_is_set(&audit_sig_lsm)) { err = security_lsmprop_to_secctx(&audit_sig_lsm, &lsmctx, LSM_ID_UNDEF); if (err < 0) return err; } sig_data = kmalloc(struct_size(sig_data, ctx, lsmctx.len), GFP_KERNEL); if (!sig_data) { if (lsmprop_is_set(&audit_sig_lsm)) security_release_secctx(&lsmctx); return -ENOMEM; } sig_data->uid = from_kuid(&init_user_ns, audit_sig_uid); sig_data->pid = audit_sig_pid; if (lsmprop_is_set(&audit_sig_lsm)) { memcpy(sig_data->ctx, lsmctx.context, lsmctx.len); security_release_secctx(&lsmctx); } audit_send_reply(skb, seq, AUDIT_SIGNAL_INFO, 0, 0, sig_data, struct_size(sig_data, ctx, lsmctx.len)); kfree(sig_data); break; case AUDIT_TTY_GET: { struct audit_tty_status s; unsigned int t; t = READ_ONCE(current->signal->audit_tty); s.enabled = t & AUDIT_TTY_ENABLE; s.log_passwd = !!(t & AUDIT_TTY_LOG_PASSWD); audit_send_reply(skb, seq, AUDIT_TTY_GET, 0, 0, &s, sizeof(s)); break; } case AUDIT_TTY_SET: { struct audit_tty_status s, old; struct audit_buffer *ab; unsigned int t; memset(&s, 0, sizeof(s)); /* guard against past and future API changes */ memcpy(&s, data, min_t(size_t, sizeof(s), data_len)); /* check if new data is valid */ if ((s.enabled != 0 && s.enabled != 1) || (s.log_passwd != 0 && s.log_passwd != 1)) err = -EINVAL; if (err) t = READ_ONCE(current->signal->audit_tty); else { t = s.enabled | (-s.log_passwd & AUDIT_TTY_LOG_PASSWD); t = xchg(¤t->signal->audit_tty, t); } old.enabled = t & AUDIT_TTY_ENABLE; old.log_passwd = !!(t & AUDIT_TTY_LOG_PASSWD); audit_log_common_recv_msg(audit_context(), &ab, AUDIT_CONFIG_CHANGE); audit_log_format(ab, " op=tty_set old-enabled=%d new-enabled=%d" " old-log_passwd=%d new-log_passwd=%d res=%d", old.enabled, s.enabled, old.log_passwd, s.log_passwd, !err); audit_log_end(ab); break; } default: err = -EINVAL; break; } return err < 0 ? err : 0; } /** * audit_receive - receive messages from a netlink control socket * @skb: the message buffer * * Parse the provided skb and deal with any messages that may be present, * malformed skbs are discarded. */ static void audit_receive(struct sk_buff *skb) { struct nlmsghdr *nlh; bool ack; /* * len MUST be signed for nlmsg_next to be able to dec it below 0 * if the nlmsg_len was not aligned */ int len; int err; nlh = nlmsg_hdr(skb); len = skb->len; audit_ctl_lock(); while (nlmsg_ok(nlh, len)) { ack = nlh->nlmsg_flags & NLM_F_ACK; err = audit_receive_msg(skb, nlh, &ack); /* send an ack if the user asked for one and audit_receive_msg * didn't already do it, or if there was an error. */ if (ack || err) netlink_ack(skb, nlh, err, NULL); nlh = nlmsg_next(nlh, &len); } audit_ctl_unlock(); /* can't block with the ctrl lock, so penalize the sender now */ if (audit_backlog_limit && (skb_queue_len(&audit_queue) > audit_backlog_limit)) { DECLARE_WAITQUEUE(wait, current); /* wake kauditd to try and flush the queue */ wake_up_interruptible(&kauditd_wait); add_wait_queue_exclusive(&audit_backlog_wait, &wait); set_current_state(TASK_UNINTERRUPTIBLE); schedule_timeout(audit_backlog_wait_time); remove_wait_queue(&audit_backlog_wait, &wait); } } /* Log information about who is connecting to the audit multicast socket */ static void audit_log_multicast(int group, const char *op, int err) { const struct cred *cred; struct tty_struct *tty; char comm[sizeof(current->comm)]; struct audit_buffer *ab; if (!audit_enabled) return; ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_EVENT_LISTENER); if (!ab) return; cred = current_cred(); tty = audit_get_tty(); audit_log_format(ab, "pid=%u uid=%u auid=%u tty=%s ses=%u", task_tgid_nr(current), from_kuid(&init_user_ns, cred->uid), from_kuid(&init_user_ns, audit_get_loginuid(current)), tty ? tty_name(tty) : "(none)", audit_get_sessionid(current)); audit_put_tty(tty); audit_log_task_context(ab); /* subj= */ audit_log_format(ab, " comm="); audit_log_untrustedstring(ab, get_task_comm(comm, current)); audit_log_d_path_exe(ab, current->mm); /* exe= */ audit_log_format(ab, " nl-mcgrp=%d op=%s res=%d", group, op, !err); audit_log_end(ab); } /* Run custom bind function on netlink socket group connect or bind requests. */ static int audit_multicast_bind(struct net *net, int group) { int err = 0; if (!capable(CAP_AUDIT_READ)) err = -EPERM; audit_log_multicast(group, "connect", err); return err; } static void audit_multicast_unbind(struct net *net, int group) { audit_log_multicast(group, "disconnect", 0); } static int __net_init audit_net_init(struct net *net) { struct netlink_kernel_cfg cfg = { .input = audit_receive, .bind = audit_multicast_bind, .unbind = audit_multicast_unbind, .flags = NL_CFG_F_NONROOT_RECV, .groups = AUDIT_NLGRP_MAX, }; struct audit_net *aunet = net_generic(net, audit_net_id); aunet->sk = netlink_kernel_create(net, NETLINK_AUDIT, &cfg); if (aunet->sk == NULL) { audit_panic("cannot initialize netlink socket in namespace"); return -ENOMEM; } /* limit the timeout in case auditd is blocked/stopped */ aunet->sk->sk_sndtimeo = HZ / 10; return 0; } static void __net_exit audit_net_exit(struct net *net) { struct audit_net *aunet = net_generic(net, audit_net_id); /* NOTE: you would think that we would want to check the auditd * connection and potentially reset it here if it lives in this * namespace, but since the auditd connection tracking struct holds a * reference to this namespace (see auditd_set()) we are only ever * going to get here after that connection has been released */ netlink_kernel_release(aunet->sk); } static struct pernet_operations audit_net_ops __net_initdata = { .init = audit_net_init, .exit = audit_net_exit, .id = &audit_net_id, .size = sizeof(struct audit_net), }; /* Initialize audit support at boot time. */ static int __init audit_init(void) { int i; if (audit_initialized == AUDIT_DISABLED) return 0; audit_buffer_cache = KMEM_CACHE(audit_buffer, SLAB_PANIC); skb_queue_head_init(&audit_queue); skb_queue_head_init(&audit_retry_queue); skb_queue_head_init(&audit_hold_queue); for (i = 0; i < AUDIT_INODE_BUCKETS; i++) INIT_LIST_HEAD(&audit_inode_hash[i]); mutex_init(&audit_cmd_mutex.lock); audit_cmd_mutex.owner = NULL; pr_info("initializing netlink subsys (%s)\n", str_enabled_disabled(audit_default)); register_pernet_subsys(&audit_net_ops); audit_initialized = AUDIT_INITIALIZED; kauditd_task = kthread_run(kauditd_thread, NULL, "kauditd"); if (IS_ERR(kauditd_task)) { int err = PTR_ERR(kauditd_task); panic("audit: failed to start the kauditd thread (%d)\n", err); } audit_log(NULL, GFP_KERNEL, AUDIT_KERNEL, "state=initialized audit_enabled=%u res=1", audit_enabled); return 0; } postcore_initcall(audit_init); /* * Process kernel command-line parameter at boot time. * audit={0|off} or audit={1|on}. */ static int __init audit_enable(char *str) { if (!strcasecmp(str, "off") || !strcmp(str, "0")) audit_default = AUDIT_OFF; else if (!strcasecmp(str, "on") || !strcmp(str, "1")) audit_default = AUDIT_ON; else { pr_err("audit: invalid 'audit' parameter value (%s)\n", str); audit_default = AUDIT_ON; } if (audit_default == AUDIT_OFF) audit_initialized = AUDIT_DISABLED; if (audit_set_enabled(audit_default)) pr_err("audit: error setting audit state (%d)\n", audit_default); pr_info("%s\n", audit_default ? "enabled (after initialization)" : "disabled (until reboot)"); return 1; } __setup("audit=", audit_enable); /* Process kernel command-line parameter at boot time. * audit_backlog_limit=<n> */ static int __init audit_backlog_limit_set(char *str) { u32 audit_backlog_limit_arg; pr_info("audit_backlog_limit: "); if (kstrtouint(str, 0, &audit_backlog_limit_arg)) { pr_cont("using default of %u, unable to parse %s\n", audit_backlog_limit, str); return 1; } audit_backlog_limit = audit_backlog_limit_arg; pr_cont("%d\n", audit_backlog_limit); return 1; } __setup("audit_backlog_limit=", audit_backlog_limit_set); static void audit_buffer_free(struct audit_buffer *ab) { struct sk_buff *skb; if (!ab) return; while ((skb = skb_dequeue(&ab->skb_list))) kfree_skb(skb); kmem_cache_free(audit_buffer_cache, ab); } static struct audit_buffer *audit_buffer_alloc(struct audit_context *ctx, gfp_t gfp_mask, int type) { struct audit_buffer *ab; ab = kmem_cache_alloc(audit_buffer_cache, gfp_mask); if (!ab) return NULL; skb_queue_head_init(&ab->skb_list); ab->skb = nlmsg_new(AUDIT_BUFSIZ, gfp_mask); if (!ab->skb) goto err; skb_queue_tail(&ab->skb_list, ab->skb); if (!nlmsg_put(ab->skb, 0, 0, type, 0, 0)) goto err; ab->ctx = ctx; ab->gfp_mask = gfp_mask; return ab; err: audit_buffer_free(ab); return NULL; } /** * audit_serial - compute a serial number for the audit record * * Compute a serial number for the audit record. Audit records are * written to user-space as soon as they are generated, so a complete * audit record may be written in several pieces. The timestamp of the * record and this serial number are used by the user-space tools to * determine which pieces belong to the same audit record. The * (timestamp,serial) tuple is unique for each syscall and is live from * syscall entry to syscall exit. * * NOTE: Another possibility is to store the formatted records off the * audit context (for those records that have a context), and emit them * all at syscall exit. However, this could delay the reporting of * significant errors until syscall exit (or never, if the system * halts). */ unsigned int audit_serial(void) { static atomic_t serial = ATOMIC_INIT(0); return atomic_inc_return(&serial); } static inline void audit_get_stamp(struct audit_context *ctx, struct audit_stamp *stamp) { if (!ctx || !auditsc_get_stamp(ctx, stamp)) { ktime_get_coarse_real_ts64(&stamp->ctime); stamp->serial = audit_serial(); } } /** * audit_log_start - obtain an audit buffer * @ctx: audit_context (may be NULL) * @gfp_mask: type of allocation * @type: audit message type * * Returns audit_buffer pointer on success or NULL on error. * * Obtain an audit buffer. This routine does locking to obtain the * audit buffer, but then no locking is required for calls to * audit_log_*format. If the task (ctx) is a task that is currently in a * syscall, then the syscall is marked as auditable and an audit record * will be written at syscall exit. If there is no associated task, then * task context (ctx) should be NULL. */ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask, int type) { struct audit_buffer *ab; if (audit_initialized != AUDIT_INITIALIZED) return NULL; if (unlikely(!audit_filter(type, AUDIT_FILTER_EXCLUDE))) return NULL; /* NOTE: don't ever fail/sleep on these two conditions: * 1. auditd generated record - since we need auditd to drain the * queue; also, when we are checking for auditd, compare PIDs using * task_tgid_vnr() since auditd_pid is set in audit_receive_msg() * using a PID anchored in the caller's namespace * 2. generator holding the audit_cmd_mutex - we don't want to block * while holding the mutex, although we do penalize the sender * later in audit_receive() when it is safe to block */ if (!(auditd_test_task(current) || audit_ctl_owner_current())) { long stime = audit_backlog_wait_time; while (audit_backlog_limit && (skb_queue_len(&audit_queue) > audit_backlog_limit)) { /* wake kauditd to try and flush the queue */ wake_up_interruptible(&kauditd_wait); /* sleep if we are allowed and we haven't exhausted our * backlog wait limit */ if (gfpflags_allow_blocking(gfp_mask) && (stime > 0)) { long rtime = stime; DECLARE_WAITQUEUE(wait, current); add_wait_queue_exclusive(&audit_backlog_wait, &wait); set_current_state(TASK_UNINTERRUPTIBLE); stime = schedule_timeout(rtime); atomic_add(rtime - stime, &audit_backlog_wait_time_actual); remove_wait_queue(&audit_backlog_wait, &wait); } else { if (audit_rate_check() && printk_ratelimit()) pr_warn("audit_backlog=%d > audit_backlog_limit=%d\n", skb_queue_len(&audit_queue), audit_backlog_limit); audit_log_lost("backlog limit exceeded"); return NULL; } } } ab = audit_buffer_alloc(ctx, gfp_mask, type); if (!ab) { audit_log_lost("out of memory in audit_log_start"); return NULL; } audit_get_stamp(ab->ctx, &ab->stamp); /* cancel dummy context to enable supporting records */ if (ctx) ctx->dummy = 0; audit_log_format(ab, "audit(%llu.%03lu:%u): ", (unsigned long long)ab->stamp.ctime.tv_sec, ab->stamp.ctime.tv_nsec/1000000, ab->stamp.serial); return ab; } /** * audit_expand - expand skb in the audit buffer * @ab: audit_buffer * @extra: space to add at tail of the skb * * Returns 0 (no space) on failed expansion, or available space if * successful. */ static inline int audit_expand(struct audit_buffer *ab, int extra) { struct sk_buff *skb = ab->skb; int oldtail = skb_tailroom(skb); int ret = pskb_expand_head(skb, 0, extra, ab->gfp_mask); int newtail = skb_tailroom(skb); if (ret < 0) { audit_log_lost("out of memory in audit_expand"); return 0; } skb->truesize += newtail - oldtail; return newtail; } /* * Format an audit message into the audit buffer. If there isn't enough * room in the audit buffer, more room will be allocated and vsnprint * will be called a second time. Currently, we assume that a printk * can't format message larger than 1024 bytes, so we don't either. */ static __printf(2, 0) void audit_log_vformat(struct audit_buffer *ab, const char *fmt, va_list args) { int len, avail; struct sk_buff *skb; va_list args2; if (!ab) return; BUG_ON(!ab->skb); skb = ab->skb; avail = skb_tailroom(skb); if (avail == 0) { avail = audit_expand(ab, AUDIT_BUFSIZ); if (!avail) goto out; } va_copy(args2, args); len = vsnprintf(skb_tail_pointer(skb), avail, fmt, args); if (len >= avail) { /* The printk buffer is 1024 bytes long, so if we get * here and AUDIT_BUFSIZ is at least 1024, then we can * log everything that printk could have logged. */ avail = audit_expand(ab, max_t(unsigned, AUDIT_BUFSIZ, 1+len-avail)); if (!avail) goto out_va_end; len = vsnprintf(skb_tail_pointer(skb), avail, fmt, args2); } if (len > 0) skb_put(skb, len); out_va_end: va_end(args2); out: return; } /** * audit_log_format - format a message into the audit buffer. * @ab: audit_buffer * @fmt: format string * @...: optional parameters matching @fmt string * * All the work is done in audit_log_vformat. */ void audit_log_format(struct audit_buffer *ab, const char *fmt, ...) { va_list args; if (!ab) return; va_start(args, fmt); audit_log_vformat(ab, fmt, args); va_end(args); } /** * audit_log_n_hex - convert a buffer to hex and append it to the audit skb * @ab: the audit_buffer * @buf: buffer to convert to hex * @len: length of @buf to be converted * * No return value; failure to expand is silently ignored. * * This function will take the passed buf and convert it into a string of * ascii hex digits. The new string is placed onto the skb. */ void audit_log_n_hex(struct audit_buffer *ab, const unsigned char *buf, size_t len) { int i, avail, new_len; unsigned char *ptr; struct sk_buff *skb; if (!ab) return; BUG_ON(!ab->skb); skb = ab->skb; avail = skb_tailroom(skb); new_len = len<<1; if (new_len >= avail) { /* Round the buffer request up to the next multiple */ new_len = AUDIT_BUFSIZ*(((new_len-avail)/AUDIT_BUFSIZ) + 1); avail = audit_expand(ab, new_len); if (!avail) return; } ptr = skb_tail_pointer(skb); for (i = 0; i < len; i++) ptr = hex_byte_pack_upper(ptr, buf[i]); *ptr = 0; skb_put(skb, len << 1); /* new string is twice the old string */ } /* * Format a string of no more than slen characters into the audit buffer, * enclosed in quote marks. */ void audit_log_n_string(struct audit_buffer *ab, const char *string, size_t slen) { int avail, new_len; unsigned char *ptr; struct sk_buff *skb; if (!ab) return; BUG_ON(!ab->skb); skb = ab->skb; avail = skb_tailroom(skb); new_len = slen + 3; /* enclosing quotes + null terminator */ if (new_len > avail) { avail = audit_expand(ab, new_len); if (!avail) return; } ptr = skb_tail_pointer(skb); *ptr++ = '"'; memcpy(ptr, string, slen); ptr += slen; *ptr++ = '"'; *ptr = 0; skb_put(skb, slen + 2); /* don't include null terminator */ } /** * audit_string_contains_control - does a string need to be logged in hex * @string: string to be checked * @len: max length of the string to check */ bool audit_string_contains_control(const char *string, size_t len) { const unsigned char *p; for (p = string; p < (const unsigned char *)string + len; p++) { if (*p == '"' || *p < 0x21 || *p > 0x7e) return true; } return false; } /** * audit_log_n_untrustedstring - log a string that may contain random characters * @ab: audit_buffer * @string: string to be logged * @len: length of string (not including trailing null) * * This code will escape a string that is passed to it if the string * contains a control character, unprintable character, double quote mark, * or a space. Unescaped strings will start and end with a double quote mark. * Strings that are escaped are printed in hex (2 digits per char). * * The caller specifies the number of characters in the string to log, which may * or may not be the entire string. */ void audit_log_n_untrustedstring(struct audit_buffer *ab, const char *string, size_t len) { if (audit_string_contains_control(string, len)) audit_log_n_hex(ab, string, len); else audit_log_n_string(ab, string, len); } /** * audit_log_untrustedstring - log a string that may contain random characters * @ab: audit_buffer * @string: string to be logged * * Same as audit_log_n_untrustedstring(), except that strlen is used to * determine string length. */ void audit_log_untrustedstring(struct audit_buffer *ab, const char *string) { audit_log_n_untrustedstring(ab, string, strlen(string)); } /* This is a helper-function to print the escaped d_path */ void audit_log_d_path(struct audit_buffer *ab, const char *prefix, const struct path *path) { char *p, *pathname; if (prefix) audit_log_format(ab, "%s", prefix); /* We will allow 11 spaces for ' (deleted)' to be appended */ pathname = kmalloc(PATH_MAX+11, ab->gfp_mask); if (!pathname) { audit_log_format(ab, "\"<no_memory>\""); return; } p = d_path(path, pathname, PATH_MAX+11); if (IS_ERR(p)) { /* Should never happen since we send PATH_MAX */ /* FIXME: can we save some information here? */ audit_log_format(ab, "\"<too_long>\""); } else audit_log_untrustedstring(ab, p); kfree(pathname); } void audit_log_session_info(struct audit_buffer *ab) { unsigned int sessionid = audit_get_sessionid(current); uid_t auid = from_kuid(&init_user_ns, audit_get_loginuid(current)); audit_log_format(ab, "auid=%u ses=%u", auid, sessionid); } void audit_log_key(struct audit_buffer *ab, char *key) { audit_log_format(ab, " key="); if (key) audit_log_untrustedstring(ab, key); else audit_log_format(ab, "(null)"); } /** * audit_buffer_aux_new - Add an aux record buffer to the skb list * @ab: audit_buffer * @type: message type * * Aux records are allocated and added to the skb list of * the "main" record. The ab->skb is reset to point to the * aux record on its creation. When the aux record in complete * ab->skb has to be reset to point to the "main" record. * This allows the audit_log_ functions to be ignorant of * which kind of record it is logging to. It also avoids adding * special data for aux records. * * On success ab->skb will point to the new aux record. * Returns 0 on success, -ENOMEM should allocation fail. */ static int audit_buffer_aux_new(struct audit_buffer *ab, int type) { WARN_ON(ab->skb != skb_peek(&ab->skb_list)); ab->skb = nlmsg_new(AUDIT_BUFSIZ, ab->gfp_mask); if (!ab->skb) goto err; if (!nlmsg_put(ab->skb, 0, 0, type, 0, 0)) goto err; skb_queue_tail(&ab->skb_list, ab->skb); audit_log_format(ab, "audit(%llu.%03lu:%u): ", (unsigned long long)ab->stamp.ctime.tv_sec, ab->stamp.ctime.tv_nsec/1000000, ab->stamp.serial); return 0; err: kfree_skb(ab->skb); ab->skb = skb_peek(&ab->skb_list); return -ENOMEM; } /** * audit_buffer_aux_end - Switch back to the "main" record from an aux record * @ab: audit_buffer * * Restores the "main" audit record to ab->skb. */ static void audit_buffer_aux_end(struct audit_buffer *ab) { ab->skb = skb_peek(&ab->skb_list); } /** * audit_log_subj_ctx - Add LSM subject information * @ab: audit_buffer * @prop: LSM subject properties. * * Add a subj= field and, if necessary, a AUDIT_MAC_TASK_CONTEXTS record. */ int audit_log_subj_ctx(struct audit_buffer *ab, struct lsm_prop *prop) { struct lsm_context ctx; char *space = ""; int error; int i; security_current_getlsmprop_subj(prop); if (!lsmprop_is_set(prop)) return 0; if (audit_subj_secctx_cnt < 2) { error = security_lsmprop_to_secctx(prop, &ctx, LSM_ID_UNDEF); if (error < 0) { if (error != -EINVAL) goto error_path; return 0; } audit_log_format(ab, " subj=%s", ctx.context); security_release_secctx(&ctx); return 0; } /* Multiple LSMs provide contexts. Include an aux record. */ audit_log_format(ab, " subj=?"); error = audit_buffer_aux_new(ab, AUDIT_MAC_TASK_CONTEXTS); if (error) goto error_path; for (i = 0; i < audit_subj_secctx_cnt; i++) { error = security_lsmprop_to_secctx(prop, &ctx, audit_subj_lsms[i]->id); if (error < 0) { /* * Don't print anything. An LSM like BPF could * claim to support contexts, but only do so under * certain conditions. */ if (error == -EOPNOTSUPP) continue; if (error != -EINVAL) audit_panic("error in audit_log_subj_ctx"); } else { audit_log_format(ab, "%ssubj_%s=%s", space, audit_subj_lsms[i]->name, ctx.context); space = " "; security_release_secctx(&ctx); } } audit_buffer_aux_end(ab); return 0; error_path: audit_panic("error in audit_log_subj_ctx"); return error; } EXPORT_SYMBOL(audit_log_subj_ctx); int audit_log_task_context(struct audit_buffer *ab) { struct lsm_prop prop; security_current_getlsmprop_subj(&prop); return audit_log_subj_ctx(ab, &prop); } EXPORT_SYMBOL(audit_log_task_context); int audit_log_obj_ctx(struct audit_buffer *ab, struct lsm_prop *prop) { int i; int rc; int error = 0; char *space = ""; struct lsm_context ctx; if (audit_obj_secctx_cnt < 2) { error = security_lsmprop_to_secctx(prop, &ctx, LSM_ID_UNDEF); if (error < 0) { if (error != -EINVAL) goto error_path; return error; } audit_log_format(ab, " obj=%s", ctx.context); security_release_secctx(&ctx); return 0; } audit_log_format(ab, " obj=?"); error = audit_buffer_aux_new(ab, AUDIT_MAC_OBJ_CONTEXTS); if (error) goto error_path; for (i = 0; i < audit_obj_secctx_cnt; i++) { rc = security_lsmprop_to_secctx(prop, &ctx, audit_obj_lsms[i]->id); if (rc < 0) { audit_log_format(ab, "%sobj_%s=?", space, audit_obj_lsms[i]->name); if (rc != -EINVAL) audit_panic("error in audit_log_obj_ctx"); error = rc; } else { audit_log_format(ab, "%sobj_%s=%s", space, audit_obj_lsms[i]->name, ctx.context); security_release_secctx(&ctx); } space = " "; } audit_buffer_aux_end(ab); return error; error_path: audit_panic("error in audit_log_obj_ctx"); return error; } void audit_log_d_path_exe(struct audit_buffer *ab, struct mm_struct *mm) { struct file *exe_file; if (!mm) goto out_null; exe_file = get_mm_exe_file(mm); if (!exe_file) goto out_null; audit_log_d_path(ab, " exe=", &exe_file->f_path); fput(exe_file); return; out_null: audit_log_format(ab, " exe=(null)"); } struct tty_struct *audit_get_tty(void) { struct tty_struct *tty = NULL; unsigned long flags; spin_lock_irqsave(¤t->sighand->siglock, flags); if (current->signal) tty = tty_kref_get(current->signal->tty); spin_unlock_irqrestore(¤t->sighand->siglock, flags); return tty; } void audit_put_tty(struct tty_struct *tty) { tty_kref_put(tty); } void audit_log_task_info(struct audit_buffer *ab) { const struct cred *cred; char comm[sizeof(current->comm)]; struct tty_struct *tty; if (!ab) return; cred = current_cred(); tty = audit_get_tty(); audit_log_format(ab, " ppid=%d pid=%d auid=%u uid=%u gid=%u" " euid=%u suid=%u fsuid=%u" " egid=%u sgid=%u fsgid=%u tty=%s ses=%u", task_ppid_nr(current), task_tgid_nr(current), from_kuid(&init_user_ns, audit_get_loginuid(current)), from_kuid(&init_user_ns, cred->uid), from_kgid(&init_user_ns, cred->gid), from_kuid(&init_user_ns, cred->euid), from_kuid(&init_user_ns, cred->suid), from_kuid(&init_user_ns, cred->fsuid), from_kgid(&init_user_ns, cred->egid), from_kgid(&init_user_ns, cred->sgid), from_kgid(&init_user_ns, cred->fsgid), tty ? tty_name(tty) : "(none)", audit_get_sessionid(current)); audit_put_tty(tty); audit_log_format(ab, " comm="); audit_log_untrustedstring(ab, get_task_comm(comm, current)); audit_log_d_path_exe(ab, current->mm); audit_log_task_context(ab); } EXPORT_SYMBOL(audit_log_task_info); /** * audit_log_path_denied - report a path restriction denial * @type: audit message type (AUDIT_ANOM_LINK, AUDIT_ANOM_CREAT, etc) * @operation: specific operation name */ void audit_log_path_denied(int type, const char *operation) { struct audit_buffer *ab; if (!audit_enabled) return; /* Generate log with subject, operation, outcome. */ ab = audit_log_start(audit_context(), GFP_KERNEL, type); if (!ab) return; audit_log_format(ab, "op=%s", operation); audit_log_task_info(ab); audit_log_format(ab, " res=0"); audit_log_end(ab); } /* global counter which is incremented every time something logs in */ static atomic_t session_id = ATOMIC_INIT(0); static int audit_set_loginuid_perm(kuid_t loginuid) { /* if we are unset, we don't need privs */ if (!audit_loginuid_set(current)) return 0; /* if AUDIT_FEATURE_LOGINUID_IMMUTABLE means never ever allow a change*/ if (is_audit_feature_set(AUDIT_FEATURE_LOGINUID_IMMUTABLE)) return -EPERM; /* it is set, you need permission */ if (!capable(CAP_AUDIT_CONTROL)) return -EPERM; /* reject if this is not an unset and we don't allow that */ if (is_audit_feature_set(AUDIT_FEATURE_ONLY_UNSET_LOGINUID) && uid_valid(loginuid)) return -EPERM; return 0; } static void audit_log_set_loginuid(kuid_t koldloginuid, kuid_t kloginuid, unsigned int oldsessionid, unsigned int sessionid, int rc) { struct audit_buffer *ab; uid_t uid, oldloginuid, loginuid; struct tty_struct *tty; if (!audit_enabled) return; ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_LOGIN); if (!ab) return; uid = from_kuid(&init_user_ns, task_uid(current)); oldloginuid = from_kuid(&init_user_ns, koldloginuid); loginuid = from_kuid(&init_user_ns, kloginuid); tty = audit_get_tty(); audit_log_format(ab, "pid=%d uid=%u", task_tgid_nr(current), uid); audit_log_task_context(ab); audit_log_format(ab, " old-auid=%u auid=%u tty=%s old-ses=%u ses=%u res=%d", oldloginuid, loginuid, tty ? tty_name(tty) : "(none)", oldsessionid, sessionid, !rc); audit_put_tty(tty); audit_log_end(ab); } /** * audit_set_loginuid - set current task's loginuid * @loginuid: loginuid value * * Returns 0. * * Called (set) from fs/proc/base.c::proc_loginuid_write(). */ int audit_set_loginuid(kuid_t loginuid) { unsigned int oldsessionid, sessionid = AUDIT_SID_UNSET; kuid_t oldloginuid; int rc; oldloginuid = audit_get_loginuid(current); oldsessionid = audit_get_sessionid(current); rc = audit_set_loginuid_perm(loginuid); if (rc) goto out; /* are we setting or clearing? */ if (uid_valid(loginuid)) { sessionid = (unsigned int)atomic_inc_return(&session_id); if (unlikely(sessionid == AUDIT_SID_UNSET)) sessionid = (unsigned int)atomic_inc_return(&session_id); } current->sessionid = sessionid; current->loginuid = loginuid; out: audit_log_set_loginuid(oldloginuid, loginuid, oldsessionid, sessionid, rc); return rc; } /** * audit_signal_info - record signal info for shutting down audit subsystem * @sig: signal value * @t: task being signaled * * If the audit subsystem is being terminated, record the task (pid) * and uid that is doing that. */ int audit_signal_info(int sig, struct task_struct *t) { kuid_t uid = current_uid(), auid; if (auditd_test_task(t) && (sig == SIGTERM || sig == SIGHUP || sig == SIGUSR1 || sig == SIGUSR2)) { audit_sig_pid = task_tgid_nr(current); auid = audit_get_loginuid(current); if (uid_valid(auid)) audit_sig_uid = auid; else audit_sig_uid = uid; security_current_getlsmprop_subj(&audit_sig_lsm); } return audit_signal_info_syscall(t); } /** * __audit_log_end - enqueue one audit record * @skb: the buffer to send */ static void __audit_log_end(struct sk_buff *skb) { struct nlmsghdr *nlh; if (audit_rate_check()) { /* setup the netlink header, see the comments in * kauditd_send_multicast_skb() for length quirks */ nlh = nlmsg_hdr(skb); nlh->nlmsg_len = skb->len - NLMSG_HDRLEN; /* queue the netlink packet */ skb_queue_tail(&audit_queue, skb); } else { audit_log_lost("rate limit exceeded"); kfree_skb(skb); } } /** * audit_log_end - end one audit record * @ab: the audit_buffer * * We can not do a netlink send inside an irq context because it blocks (last * arg, flags, is not set to MSG_DONTWAIT), so the audit buffer is placed on a * queue and a kthread is scheduled to remove them from the queue outside the * irq context. May be called in any context. */ void audit_log_end(struct audit_buffer *ab) { struct sk_buff *skb; if (!ab) return; while ((skb = skb_dequeue(&ab->skb_list))) __audit_log_end(skb); /* poke the kauditd thread */ wake_up_interruptible(&kauditd_wait); audit_buffer_free(ab); } /** * audit_log - Log an audit record * @ctx: audit context * @gfp_mask: type of allocation * @type: audit message type * @fmt: format string to use * @...: variable parameters matching the format string * * This is a convenience function that calls audit_log_start, * audit_log_vformat, and audit_log_end. It may be called * in any context. */ void audit_log(struct audit_context *ctx, gfp_t gfp_mask, int type, const char *fmt, ...) { struct audit_buffer *ab; va_list args; ab = audit_log_start(ctx, gfp_mask, type); if (ab) { va_start(args, fmt); audit_log_vformat(ab, fmt, args); va_end(args); audit_log_end(ab); } } EXPORT_SYMBOL(audit_log_start); EXPORT_SYMBOL(audit_log_end); EXPORT_SYMBOL(audit_log_format); EXPORT_SYMBOL(audit_log); |
| 63 63 63 63 63 63 63 63 67 67 67 67 20 20 20 20 63 63 63 63 63 67 66 67 67 67 67 67 67 67 66 67 1 67 67 67 67 67 67 67 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 | // SPDX-License-Identifier: GPL-2.0-only /* * AppArmor security module * * This file contains AppArmor policy attachment and domain transitions * * Copyright (C) 2002-2008 Novell/SUSE * Copyright 2009-2010 Canonical Ltd. */ #include <linux/errno.h> #include <linux/fs.h> #include <linux/file.h> #include <linux/mount.h> #include <linux/syscalls.h> #include <linux/personality.h> #include <linux/xattr.h> #include <linux/user_namespace.h> #include "include/audit.h" #include "include/apparmorfs.h" #include "include/cred.h" #include "include/domain.h" #include "include/file.h" #include "include/ipc.h" #include "include/match.h" #include "include/path.h" #include "include/policy.h" #include "include/policy_ns.h" static const char * const CONFLICTING_ATTACH_STR = "conflicting profile attachments"; static const char * const CONFLICTING_ATTACH_STR_IX = "conflicting profile attachments - ix fallback"; static const char * const CONFLICTING_ATTACH_STR_UX = "conflicting profile attachments - ux fallback"; /** * may_change_ptraced_domain - check if can change profile on ptraced task * @to_cred: cred of task changing domain * @to_label: profile to change to (NOT NULL) * @info: message if there is an error * * Check if current is ptraced and if so if the tracing task is allowed * to trace the new domain * * Returns: %0 or error if change not allowed */ static int may_change_ptraced_domain(const struct cred *to_cred, struct aa_label *to_label, const char **info) { struct task_struct *tracer; struct aa_label *tracerl = NULL; const struct cred *tracer_cred = NULL; int error = 0; rcu_read_lock(); tracer = ptrace_parent(current); if (tracer) { /* released below */ tracerl = aa_get_task_label(tracer); tracer_cred = get_task_cred(tracer); } /* not ptraced */ if (!tracer || unconfined(tracerl)) goto out; error = aa_may_ptrace(tracer_cred, tracerl, to_cred, to_label, PTRACE_MODE_ATTACH); out: rcu_read_unlock(); aa_put_label(tracerl); put_cred(tracer_cred); if (error) *info = "ptrace prevents transition"; return error; } /**** TODO: dedup to aa_label_match - needs perm and dfa, merging * specifically this is an exact copy of aa_label_match except * aa_compute_perms is replaced with aa_compute_fperms * and policy->dfa with file->dfa ****/ /* match a profile and its associated ns component if needed * Assumes visibility test has already been done. * If a subns profile is not to be matched should be prescreened with * visibility test. */ static inline aa_state_t match_component(struct aa_profile *profile, struct aa_profile *tp, bool stack, aa_state_t state) { struct aa_ruleset *rules = profile->label.rules[0]; const char *ns_name; if (stack) state = aa_dfa_match(rules->file->dfa, state, "&"); if (profile->ns == tp->ns) return aa_dfa_match(rules->file->dfa, state, tp->base.hname); /* try matching with namespace name and then profile */ ns_name = aa_ns_name(profile->ns, tp->ns, true); state = aa_dfa_match_len(rules->file->dfa, state, ":", 1); state = aa_dfa_match(rules->file->dfa, state, ns_name); state = aa_dfa_match_len(rules->file->dfa, state, ":", 1); return aa_dfa_match(rules->file->dfa, state, tp->base.hname); } /** * label_compound_match - find perms for full compound label * @profile: profile to find perms for * @label: label to check access permissions for * @stack: whether this is a stacking request * @state: state to start match in * @subns: whether to do permission checks on components in a subns * @request: permissions to request * @perms: perms struct to set * * Returns: 0 on success else ERROR * * For the label A//&B//&C this does the perm match for A//&B//&C * @perms should be preinitialized with allperms OR a previous permission * check to be stacked. */ static int label_compound_match(struct aa_profile *profile, struct aa_label *label, bool stack, aa_state_t state, bool subns, u32 request, struct aa_perms *perms) { struct aa_ruleset *rules = profile->label.rules[0]; struct aa_profile *tp; struct label_it i; struct path_cond cond = { }; /* find first subcomponent that is visible */ label_for_each(i, label, tp) { if (!aa_ns_visible(profile->ns, tp->ns, subns)) continue; state = match_component(profile, tp, stack, state); if (!state) goto fail; goto next; } /* no component visible */ *perms = allperms; return 0; next: label_for_each_cont(i, label, tp) { if (!aa_ns_visible(profile->ns, tp->ns, subns)) continue; state = aa_dfa_match(rules->file->dfa, state, "//&"); state = match_component(profile, tp, false, state); if (!state) goto fail; } *perms = *(aa_lookup_condperms(current_fsuid(), rules->file, state, &cond)); aa_apply_modes_to_perms(profile, perms); if ((perms->allow & request) != request) return -EACCES; return 0; fail: *perms = nullperms; return -EACCES; } /** * label_components_match - find perms for all subcomponents of a label * @profile: profile to find perms for * @label: label to check access permissions for * @stack: whether this is a stacking request * @start: state to start match in * @subns: whether to do permission checks on components in a subns * @request: permissions to request * @perms: an initialized perms struct to add accumulation to * * Returns: 0 on success else ERROR * * For the label A//&B//&C this does the perm match for each of A and B and C * @perms should be preinitialized with allperms OR a previous permission * check to be stacked. */ static int label_components_match(struct aa_profile *profile, struct aa_label *label, bool stack, aa_state_t start, bool subns, u32 request, struct aa_perms *perms) { struct aa_ruleset *rules = profile->label.rules[0]; struct aa_profile *tp; struct label_it i; struct aa_perms tmp; struct path_cond cond = { }; aa_state_t state = 0; /* find first subcomponent to test */ label_for_each(i, label, tp) { if (!aa_ns_visible(profile->ns, tp->ns, subns)) continue; state = match_component(profile, tp, stack, start); if (!state) goto fail; goto next; } /* no subcomponents visible - no change in perms */ return 0; next: tmp = *(aa_lookup_condperms(current_fsuid(), rules->file, state, &cond)); aa_apply_modes_to_perms(profile, &tmp); aa_perms_accum(perms, &tmp); label_for_each_cont(i, label, tp) { if (!aa_ns_visible(profile->ns, tp->ns, subns)) continue; state = match_component(profile, tp, stack, start); if (!state) goto fail; tmp = *(aa_lookup_condperms(current_fsuid(), rules->file, state, &cond)); aa_apply_modes_to_perms(profile, &tmp); aa_perms_accum(perms, &tmp); } if ((perms->allow & request) != request) return -EACCES; return 0; fail: *perms = nullperms; return -EACCES; } /** * label_match - do a multi-component label match * @profile: profile to match against (NOT NULL) * @label: label to match (NOT NULL) * @stack: whether this is a stacking request * @state: state to start in * @subns: whether to match subns components * @request: permission request * @perms: Returns computed perms (NOT NULL) * * Returns: the state the match finished in, may be the none matching state */ static int label_match(struct aa_profile *profile, struct aa_label *label, bool stack, aa_state_t state, bool subns, u32 request, struct aa_perms *perms) { int error; *perms = nullperms; error = label_compound_match(profile, label, stack, state, subns, request, perms); if (!error) return error; *perms = allperms; return label_components_match(profile, label, stack, state, subns, request, perms); } /******* end TODO: dedup *****/ /** * change_profile_perms - find permissions for change_profile * @profile: the current profile (NOT NULL) * @target: label to transition to (NOT NULL) * @stack: whether this is a stacking request * @request: requested perms * @start: state to start matching in * @perms: Returns computed perms (NOT NULL) * * * Returns: permission set * * currently only matches full label A//&B//&C or individual components A, B, C * not arbitrary combinations. Eg. A//&B, C */ static int change_profile_perms(struct aa_profile *profile, struct aa_label *target, bool stack, u32 request, aa_state_t start, struct aa_perms *perms) { if (profile_unconfined(profile)) { perms->allow = AA_MAY_CHANGE_PROFILE | AA_MAY_ONEXEC; perms->audit = perms->quiet = perms->kill = 0; return 0; } /* TODO: add profile in ns screening */ return label_match(profile, target, stack, start, true, request, perms); } /** * aa_xattrs_match - check whether a file matches the xattrs defined in profile * @bprm: binprm struct for the process to validate * @profile: profile to match against (NOT NULL) * @state: state to start match in * * Returns: number of extended attributes that matched, or < 0 on error */ static int aa_xattrs_match(const struct linux_binprm *bprm, struct aa_profile *profile, aa_state_t state) { int i; struct dentry *d; char *value = NULL; struct aa_attachment *attach = &profile->attach; int size, value_size = 0, ret = attach->xattr_count; if (!bprm || !attach->xattr_count) return 0; might_sleep(); /* transition from exec match to xattr set */ state = aa_dfa_outofband_transition(attach->xmatch->dfa, state); d = bprm->file->f_path.dentry; for (i = 0; i < attach->xattr_count; i++) { size = vfs_getxattr_alloc(&nop_mnt_idmap, d, attach->xattrs[i], &value, value_size, GFP_KERNEL); if (size >= 0) { struct aa_perms *perms; /* * Check the xattr presence before value. This ensure * that not present xattr can be distinguished from a 0 * length value or rule that matches any value */ state = aa_dfa_null_transition(attach->xmatch->dfa, state); /* Check xattr value */ state = aa_dfa_match_len(attach->xmatch->dfa, state, value, size); perms = aa_lookup_perms(attach->xmatch, state); if (!(perms->allow & MAY_EXEC)) { ret = -EINVAL; goto out; } } /* transition to next element */ state = aa_dfa_outofband_transition(attach->xmatch->dfa, state); if (size < 0) { /* * No xattr match, so verify if transition to * next element was valid. IFF so the xattr * was optional. */ if (!state) { ret = -EINVAL; goto out; } /* don't count missing optional xattr as matched */ ret--; } } out: kfree(value); return ret; } /** * find_attach - do attachment search for unconfined processes * @bprm: binprm structure of transitioning task * @ns: the current namespace (NOT NULL) * @head: profile list to walk (NOT NULL) * @name: to match against (NOT NULL) * @info: info message if there was an error (NOT NULL) * * Do a linear search on the profiles in the list. There is a matching * preference where an exact match is preferred over a name which uses * expressions to match, and matching expressions with the greatest * xmatch_len are preferred. * * Requires: @head not be shared or have appropriate locks held * * Returns: label or NULL if no match found */ static struct aa_label *find_attach(const struct linux_binprm *bprm, struct aa_ns *ns, struct list_head *head, const char *name, const char **info) { int candidate_len = 0, candidate_xattrs = 0; bool conflict = false; struct aa_profile *profile, *candidate = NULL; AA_BUG(!name); AA_BUG(!head); rcu_read_lock(); restart: list_for_each_entry_rcu(profile, head, base.list) { struct aa_attachment *attach = &profile->attach; if (profile->label.flags & FLAG_NULL && &profile->label == ns_unconfined(profile->ns)) continue; /* Find the "best" matching profile. Profiles must * match the path and extended attributes (if any) * associated with the file. A more specific path * match will be preferred over a less specific one, * and a match with more matching extended attributes * will be preferred over one with fewer. If the best * match has both the same level of path specificity * and the same number of matching extended attributes * as another profile, signal a conflict and refuse to * match. */ if (attach->xmatch->dfa) { unsigned int count; aa_state_t state; struct aa_perms *perms; state = aa_dfa_leftmatch(attach->xmatch->dfa, attach->xmatch->start[AA_CLASS_XMATCH], name, &count); perms = aa_lookup_perms(attach->xmatch, state); /* any accepting state means a valid match. */ if (perms->allow & MAY_EXEC) { int ret = 0; if (count < candidate_len) continue; if (bprm && attach->xattr_count) { long rev = READ_ONCE(ns->revision); if (!aa_get_profile_not0(profile)) goto restart; rcu_read_unlock(); ret = aa_xattrs_match(bprm, profile, state); rcu_read_lock(); aa_put_profile(profile); if (rev != READ_ONCE(ns->revision)) /* policy changed */ goto restart; /* * Fail matching if the xattrs don't * match */ if (ret < 0) continue; } /* * TODO: allow for more flexible best match * * The new match isn't more specific * than the current best match */ if (count == candidate_len && ret <= candidate_xattrs) { /* Match is equivalent, so conflict */ if (ret == candidate_xattrs) conflict = true; continue; } /* Either the same length with more matching * xattrs, or a longer match */ candidate = profile; candidate_len = max(count, attach->xmatch_len); candidate_xattrs = ret; conflict = false; } } else if (!strcmp(profile->base.name, name)) { /* * old exact non-re match, without conditionals such * as xattrs. no more searching required */ candidate = profile; goto out; } } if (!candidate || conflict) { if (conflict) *info = CONFLICTING_ATTACH_STR; rcu_read_unlock(); return NULL; } out: candidate = aa_get_newest_profile(candidate); rcu_read_unlock(); return &candidate->label; } static const char *next_name(int xtype, const char *name) { return NULL; } /** * x_table_lookup - lookup an x transition name via transition table * @profile: current profile (NOT NULL) * @xindex: index into x transition table * @name: returns: name tested to find label (NOT NULL) * * Returns: refcounted label, or NULL on failure (MAYBE NULL) * @name will always be set with the last name tried */ struct aa_label *x_table_lookup(struct aa_profile *profile, u32 xindex, const char **name) { struct aa_ruleset *rules = profile->label.rules[0]; struct aa_label *label = NULL; u32 xtype = xindex & AA_X_TYPE_MASK; int index = xindex & AA_X_INDEX_MASK; const char *next; AA_BUG(!name); /* index is guaranteed to be in range, validated at load time */ /* TODO: move lookup parsing to unpack time so this is a straight * index into the resultant label */ for (next = rules->file->trans.table[index]; next; next = next_name(xtype, next)) { const char *lookup = (*next == '&') ? next + 1 : next; *name = next; if (xindex & AA_X_CHILD) { /* TODO: switich to parse to get stack of child */ struct aa_profile *new = aa_find_child(profile, lookup); if (new) /* release by caller */ return &new->label; continue; } label = aa_label_parse(&profile->label, lookup, GFP_KERNEL, true, false); if (!IS_ERR_OR_NULL(label)) /* release by caller */ return label; } return NULL; } /** * x_to_label - get target label for a given xindex * @profile: current profile (NOT NULL) * @bprm: binprm structure of transitioning task * @name: name to lookup (NOT NULL) * @xindex: index into x transition table * @lookupname: returns: name used in lookup if one was specified (NOT NULL) * @info: info message if there was an error (NOT NULL) * * find label for a transition index * * Returns: refcounted label or NULL if not found available */ static struct aa_label *x_to_label(struct aa_profile *profile, const struct linux_binprm *bprm, const char *name, u32 xindex, const char **lookupname, const char **info) { struct aa_label *new = NULL; struct aa_label *stack = NULL; struct aa_ns *ns = profile->ns; u32 xtype = xindex & AA_X_TYPE_MASK; /* Used for info checks during fallback handling */ const char *old_info = NULL; switch (xtype) { case AA_X_NONE: /* fail exec unless ix || ux fallback - handled by caller */ *lookupname = NULL; break; case AA_X_TABLE: /* TODO: fix when perm mapping done at unload */ /* released by caller * if null for both stack and direct want to try fallback */ new = x_table_lookup(profile, xindex, lookupname); if (!new || **lookupname != '&') break; stack = new; new = NULL; fallthrough; /* to X_NAME */ case AA_X_NAME: if (xindex & AA_X_CHILD) /* released by caller */ new = find_attach(bprm, ns, &profile->base.profiles, name, info); else /* released by caller */ new = find_attach(bprm, ns, &ns->base.profiles, name, info); *lookupname = name; break; } /* fallback transition check */ if (!new) { if (xindex & AA_X_INHERIT) { /* (p|c|n)ix - don't change profile but do * use the newest version */ if (*info == CONFLICTING_ATTACH_STR) { *info = CONFLICTING_ATTACH_STR_IX; } else { old_info = *info; *info = "ix fallback"; } /* no profile && no error */ new = aa_get_newest_label(&profile->label); } else if (xindex & AA_X_UNCONFINED) { new = aa_get_newest_label(ns_unconfined(profile->ns)); if (*info == CONFLICTING_ATTACH_STR) { *info = CONFLICTING_ATTACH_STR_UX; } else { old_info = *info; *info = "ux fallback"; } } /* We set old_info on the code paths above where overwriting * could have happened, so now check if info was set by * find_attach as well (i.e. whether we actually overwrote) * and warn accordingly. */ if (old_info && old_info != CONFLICTING_ATTACH_STR) { pr_warn_ratelimited( "AppArmor: find_attach (from profile %s) audit info \"%s\" dropped", profile->base.hname, old_info); } } if (new && stack) { /* base the stack on post domain transition */ struct aa_label *base = new; new = aa_label_merge(base, stack, GFP_KERNEL); /* null on error */ aa_put_label(base); } aa_put_label(stack); /* released by caller */ return new; } static struct aa_label *profile_transition(const struct cred *subj_cred, struct aa_profile *profile, const struct linux_binprm *bprm, char *buffer, struct path_cond *cond, bool *secure_exec) { struct aa_ruleset *rules = profile->label.rules[0]; struct aa_label *new = NULL; struct aa_profile *new_profile = NULL; const char *info = NULL, *name = NULL, *target = NULL; aa_state_t state = rules->file->start[AA_CLASS_FILE]; struct aa_perms perms = {}; bool nonewprivs = false; int error = 0; AA_BUG(!profile); AA_BUG(!bprm); AA_BUG(!buffer); error = aa_path_name(&bprm->file->f_path, profile->path_flags, buffer, &name, &info, profile->disconnected); if (error) { if (profile_unconfined(profile) || (profile->label.flags & FLAG_IX_ON_NAME_ERROR)) { AA_DEBUG(DEBUG_DOMAIN, "name lookup ix on error"); error = 0; new = aa_get_newest_label(&profile->label); } name = bprm->filename; goto audit; } if (profile_unconfined(profile)) { new = find_attach(bprm, profile->ns, &profile->ns->base.profiles, name, &info); /* info set -> something unusual that we should report * Currently this is only conflicting attachments, but other * infos added in the future should also be logged by default * and only excluded on a case-by-case basis */ if (info) { /* Because perms is never used again after this audit * we don't need to care about clobbering it */ perms.audit |= MAY_EXEC; perms.allow |= MAY_EXEC; /* Don't cause error if auditing fails */ (void) aa_audit_file(subj_cred, profile, &perms, OP_EXEC, MAY_EXEC, name, target, new, cond->uid, info, error); } if (new) { AA_DEBUG(DEBUG_DOMAIN, "unconfined attached to new label"); return new; } AA_DEBUG(DEBUG_DOMAIN, "unconfined exec no attachment"); return aa_get_newest_label(&profile->label); } /* find exec permissions for name */ state = aa_str_perms(rules->file, state, name, cond, &perms); if (perms.allow & MAY_EXEC) { /* exec permission determine how to transition */ new = x_to_label(profile, bprm, name, perms.xindex, &target, &info); if (new && new->proxy == profile->label.proxy && info) { /* Force audit on conflicting attachment fallback * Because perms is never used again after this audit * we don't need to care about clobbering it */ if (info == CONFLICTING_ATTACH_STR_IX || info == CONFLICTING_ATTACH_STR_UX) perms.audit |= MAY_EXEC; /* hack ix fallback - improve how this is detected */ goto audit; } else if (!new) { if (info) { pr_warn_ratelimited( "AppArmor: %s (from profile %s) audit info \"%s\" dropped on missing transition", __func__, profile->base.hname, info); } info = "profile transition not found"; /* remove MAY_EXEC to audit as failure or complaint */ perms.allow &= ~MAY_EXEC; if (COMPLAIN_MODE(profile)) { /* create null profile instead of failing */ goto create_learning_profile; } error = -EACCES; } } else if (COMPLAIN_MODE(profile)) { create_learning_profile: /* no exec permission - learning mode */ new_profile = aa_new_learning_profile(profile, false, name, GFP_KERNEL); if (!new_profile) { error = -ENOMEM; info = "could not create null profile"; } else { error = -EACCES; new = &new_profile->label; } perms.xindex |= AA_X_UNSAFE; } else /* fail exec */ error = -EACCES; if (!new) goto audit; if (!(perms.xindex & AA_X_UNSAFE)) { if (DEBUG_ON) { dbg_printk("apparmor: setting AT_SECURE for %s profile=", name); aa_label_printk(new, GFP_KERNEL); dbg_printk("\n"); } *secure_exec = true; } audit: aa_audit_file(subj_cred, profile, &perms, OP_EXEC, MAY_EXEC, name, target, new, cond->uid, info, error); if (!new || nonewprivs) { aa_put_label(new); return ERR_PTR(error); } return new; } static int profile_onexec(const struct cred *subj_cred, struct aa_profile *profile, struct aa_label *onexec, bool stack, const struct linux_binprm *bprm, char *buffer, struct path_cond *cond, bool *secure_exec) { struct aa_ruleset *rules = profile->label.rules[0]; aa_state_t state = rules->file->start[AA_CLASS_FILE]; struct aa_perms perms = {}; const char *xname = NULL, *info = "change_profile onexec"; int error = -EACCES; AA_BUG(!profile); AA_BUG(!onexec); AA_BUG(!bprm); AA_BUG(!buffer); if (profile_unconfined(profile)) { /* change_profile on exec already granted */ /* * NOTE: Domain transitions from unconfined are allowed * even when no_new_privs is set because this always results * in a further reduction of permissions. */ return 0; } error = aa_path_name(&bprm->file->f_path, profile->path_flags, buffer, &xname, &info, profile->disconnected); if (error) { if (profile_unconfined(profile) || (profile->label.flags & FLAG_IX_ON_NAME_ERROR)) { AA_DEBUG(DEBUG_DOMAIN, "name lookup ix on error"); error = 0; } xname = bprm->filename; goto audit; } /* find exec permissions for name */ state = aa_str_perms(rules->file, state, xname, cond, &perms); if (!(perms.allow & AA_MAY_ONEXEC)) { info = "no change_onexec valid for executable"; goto audit; } /* test if this exec can be paired with change_profile onexec. * onexec permission is linked to exec with a standard pairing * exec\0change_profile */ state = aa_dfa_null_transition(rules->file->dfa, state); error = change_profile_perms(profile, onexec, stack, AA_MAY_ONEXEC, state, &perms); if (error) { perms.allow &= ~AA_MAY_ONEXEC; goto audit; } if (!(perms.xindex & AA_X_UNSAFE)) { if (DEBUG_ON) { dbg_printk("apparmor: setting AT_SECURE for %s label=", xname); aa_label_printk(onexec, GFP_KERNEL); dbg_printk("\n"); } *secure_exec = true; } audit: return aa_audit_file(subj_cred, profile, &perms, OP_EXEC, AA_MAY_ONEXEC, xname, NULL, onexec, cond->uid, info, error); } /* ensure none ns domain transitions are correctly applied with onexec */ static struct aa_label *handle_onexec(const struct cred *subj_cred, struct aa_label *label, struct aa_label *onexec, bool stack, const struct linux_binprm *bprm, char *buffer, struct path_cond *cond, bool *unsafe) { struct aa_profile *profile; struct aa_label *new; int error; AA_BUG(!label); AA_BUG(!onexec); AA_BUG(!bprm); AA_BUG(!buffer); /* TODO: determine how much we want to loosen this */ error = fn_for_each_in_ns(label, profile, profile_onexec(subj_cred, profile, onexec, stack, bprm, buffer, cond, unsafe)); if (error) return ERR_PTR(error); new = fn_label_build_in_ns(label, profile, GFP_KERNEL, stack ? aa_label_merge(&profile->label, onexec, GFP_KERNEL) : aa_get_newest_label(onexec), profile_transition(subj_cred, profile, bprm, buffer, cond, unsafe)); if (new) return new; /* TODO: get rid of GLOBAL_ROOT_UID */ error = fn_for_each_in_ns(label, profile, aa_audit_file(subj_cred, profile, &nullperms, OP_CHANGE_ONEXEC, AA_MAY_ONEXEC, bprm->filename, NULL, onexec, GLOBAL_ROOT_UID, "failed to build target label", -ENOMEM)); return ERR_PTR(error); } /** * apparmor_bprm_creds_for_exec - Update the new creds on the bprm struct * @bprm: binprm for the exec (NOT NULL) * * Returns: %0 or error on failure * * TODO: once the other paths are done see if we can't refactor into a fn */ int apparmor_bprm_creds_for_exec(struct linux_binprm *bprm) { struct aa_task_ctx *ctx; struct aa_label *label, *new = NULL; const struct cred *subj_cred; struct aa_profile *profile; char *buffer = NULL; const char *info = NULL; int error = 0; bool unsafe = false; vfsuid_t vfsuid = i_uid_into_vfsuid(file_mnt_idmap(bprm->file), file_inode(bprm->file)); struct path_cond cond = { vfsuid_into_kuid(vfsuid), file_inode(bprm->file)->i_mode }; subj_cred = current_cred(); ctx = task_ctx(current); AA_BUG(!cred_label(bprm->cred)); AA_BUG(!ctx); label = aa_get_newest_label(cred_label(bprm->cred)); /* * Detect no new privs being set, and store the label it * occurred under. Ideally this would happen when nnp * is set but there isn't a good way to do that yet. * * Testing for unconfined must be done before the subset test */ if ((bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS) && !unconfined(label) && !ctx->nnp) ctx->nnp = aa_get_label(label); /* buffer freed below, name is pointer into buffer */ buffer = aa_get_buffer(false); if (!buffer) { error = -ENOMEM; goto done; } /* Test for onexec first as onexec override other x transitions. */ if (ctx->onexec) new = handle_onexec(subj_cred, label, ctx->onexec, ctx->token, bprm, buffer, &cond, &unsafe); else new = fn_label_build(label, profile, GFP_KERNEL, profile_transition(subj_cred, profile, bprm, buffer, &cond, &unsafe)); AA_BUG(!new); if (IS_ERR(new)) { error = PTR_ERR(new); goto done; } else if (!new) { error = -ENOMEM; goto done; } /* Policy has specified a domain transitions. If no_new_privs and * confined ensure the transition is to confinement that is subset * of the confinement when the task entered no new privs. * * NOTE: Domain transitions from unconfined and to stacked * subsets are allowed even when no_new_privs is set because this * always results in a further reduction of permissions. */ if ((bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS) && !unconfined(label) && !aa_label_is_unconfined_subset(new, ctx->nnp)) { error = -EPERM; info = "no new privs"; goto audit; } if (bprm->unsafe & LSM_UNSAFE_SHARE) { /* FIXME: currently don't mediate shared state */ ; } if (bprm->unsafe & (LSM_UNSAFE_PTRACE)) { /* TODO: test needs to be profile of label to new */ error = may_change_ptraced_domain(bprm->cred, new, &info); if (error) goto audit; } if (unsafe) { if (DEBUG_ON) { dbg_printk("setting AT_SECURE for %s label=", bprm->filename); aa_label_printk(new, GFP_KERNEL); dbg_printk("\n"); } bprm->secureexec = 1; } if (label->proxy != new->proxy) { /* when transitioning clear unsafe personality bits */ if (DEBUG_ON) { dbg_printk("apparmor: clearing unsafe personality bits. %s label=", bprm->filename); aa_label_printk(new, GFP_KERNEL); dbg_printk("\n"); } bprm->per_clear |= PER_CLEAR_ON_SETID; } aa_put_label(cred_label(bprm->cred)); /* transfer reference, released when cred is freed */ set_cred_label(bprm->cred, new); done: aa_put_label(label); aa_put_buffer(buffer); return error; audit: error = fn_for_each(label, profile, aa_audit_file(current_cred(), profile, &nullperms, OP_EXEC, MAY_EXEC, bprm->filename, NULL, new, vfsuid_into_kuid(vfsuid), info, error)); aa_put_label(new); goto done; } /* * Functions for self directed profile change */ /* helper fn for change_hat * * Returns: label for hat transition OR ERR_PTR. Does NOT return NULL */ static struct aa_label *build_change_hat(const struct cred *subj_cred, struct aa_profile *profile, const char *name, bool sibling) { struct aa_profile *root, *hat = NULL; const char *info = NULL; int error = 0; if (sibling && PROFILE_IS_HAT(profile)) { root = aa_get_profile_rcu(&profile->parent); } else if (!sibling && !PROFILE_IS_HAT(profile)) { root = aa_get_profile(profile); } else { info = "conflicting target types"; error = -EPERM; goto audit; } hat = aa_find_child(root, name); if (!hat) { error = -ENOENT; if (COMPLAIN_MODE(profile)) { hat = aa_new_learning_profile(profile, true, name, GFP_KERNEL); if (!hat) { info = "failed null profile create"; error = -ENOMEM; } } } aa_put_profile(root); audit: aa_audit_file(subj_cred, profile, &nullperms, OP_CHANGE_HAT, AA_MAY_CHANGEHAT, name, hat ? hat->base.hname : NULL, hat ? &hat->label : NULL, GLOBAL_ROOT_UID, info, error); if (!hat || (error && error != -ENOENT)) return ERR_PTR(error); /* if hat && error - complain mode, already audited and we adjust for * complain mode allow by returning hat->label */ return &hat->label; } /* helper fn for changing into a hat * * Returns: label for hat transition or ERR_PTR. Does not return NULL */ static struct aa_label *change_hat(const struct cred *subj_cred, struct aa_label *label, const char *hats[], int count, int flags) { struct aa_profile *profile, *root, *hat = NULL; struct aa_label *new; struct label_it it; bool sibling = false; const char *name, *info = NULL; int i, error; AA_BUG(!label); AA_BUG(!hats); AA_BUG(count < 1); if (PROFILE_IS_HAT(labels_profile(label))) sibling = true; /*find first matching hat */ for (i = 0; i < count && !hat; i++) { name = hats[i]; label_for_each_in_ns(it, labels_ns(label), label, profile) { if (sibling && PROFILE_IS_HAT(profile)) { root = aa_get_profile_rcu(&profile->parent); } else if (!sibling && !PROFILE_IS_HAT(profile)) { root = aa_get_profile(profile); } else { /* conflicting change type */ info = "conflicting targets types"; error = -EPERM; goto fail; } hat = aa_find_child(root, name); aa_put_profile(root); if (!hat) { if (!COMPLAIN_MODE(profile)) goto outer_continue; /* complain mode succeed as if hat */ } else if (!PROFILE_IS_HAT(hat)) { info = "target not hat"; error = -EPERM; aa_put_profile(hat); goto fail; } aa_put_profile(hat); } /* found a hat for all profiles in ns */ goto build; outer_continue: ; } /* no hats that match, find appropriate error * * In complain mode audit of the failure is based off of the first * hat supplied. This is done due how userspace interacts with * change_hat. */ name = NULL; label_for_each_in_ns(it, labels_ns(label), label, profile) { if (!list_empty(&profile->base.profiles)) { info = "hat not found"; error = -ENOENT; goto fail; } } info = "no hats defined"; error = -ECHILD; fail: label_for_each_in_ns(it, labels_ns(label), label, profile) { /* * no target as it has failed to be found or built * * change_hat uses probing and should not log failures * related to missing hats */ /* TODO: get rid of GLOBAL_ROOT_UID */ if (count > 1 || COMPLAIN_MODE(profile)) { aa_audit_file(subj_cred, profile, &nullperms, OP_CHANGE_HAT, AA_MAY_CHANGEHAT, name, NULL, NULL, GLOBAL_ROOT_UID, info, error); } } return ERR_PTR(error); build: new = fn_label_build_in_ns(label, profile, GFP_KERNEL, build_change_hat(subj_cred, profile, name, sibling), aa_get_label(&profile->label)); if (!new) { info = "label build failed"; error = -ENOMEM; goto fail; } /* else if (IS_ERR) build_change_hat has logged error so return new */ return new; } /** * aa_change_hat - change hat to/from subprofile * @hats: vector of hat names to try changing into (MAYBE NULL if @count == 0) * @count: number of hat names in @hats * @token: magic value to validate the hat change * @flags: flags affecting behavior of the change * * Returns %0 on success, error otherwise. * * Change to the first profile specified in @hats that exists, and store * the @hat_magic in the current task context. If the count == 0 and the * @token matches that stored in the current task context, return to the * top level profile. * * change_hat only applies to profiles in the current ns, and each profile * in the ns must make the same transition otherwise change_hat will fail. */ int aa_change_hat(const char *hats[], int count, u64 token, int flags) { const struct cred *subj_cred; struct aa_task_ctx *ctx = task_ctx(current); struct aa_label *label, *previous, *new = NULL, *target = NULL; struct aa_profile *profile; struct aa_perms perms = {}; const char *info = NULL; int error = 0; /* released below */ subj_cred = get_current_cred(); label = aa_get_newest_cred_label(subj_cred); previous = aa_get_newest_label(ctx->previous); /* * Detect no new privs being set, and store the label it * occurred under. Ideally this would happen when nnp * is set but there isn't a good way to do that yet. * * Testing for unconfined must be done before the subset test */ if (task_no_new_privs(current) && !unconfined(label) && !ctx->nnp) ctx->nnp = aa_get_label(label); /* return -EPERM when unconfined doesn't have children to avoid * changing the traditional error code for unconfined. */ if (unconfined(label)) { struct label_it i; bool empty = true; rcu_read_lock(); label_for_each_in_ns(i, labels_ns(label), label, profile) { empty &= list_empty(&profile->base.profiles); } rcu_read_unlock(); if (empty) { info = "unconfined can not change_hat"; error = -EPERM; goto fail; } } if (count) { new = change_hat(subj_cred, label, hats, count, flags); AA_BUG(!new); if (IS_ERR(new)) { error = PTR_ERR(new); new = NULL; /* already audited */ goto out; } /* target cred is the same as current except new label */ error = may_change_ptraced_domain(subj_cred, new, &info); if (error) goto fail; /* * no new privs prevents domain transitions that would * reduce restrictions. */ if (task_no_new_privs(current) && !unconfined(label) && !aa_label_is_unconfined_subset(new, ctx->nnp)) { /* not an apparmor denial per se, so don't log it */ AA_DEBUG(DEBUG_DOMAIN, "no_new_privs - change_hat denied"); error = -EPERM; goto out; } if (flags & AA_CHANGE_TEST) goto out; target = new; error = aa_set_current_hat(new, token); if (error == -EACCES) /* kill task in case of brute force attacks */ goto kill; } else if (previous && !(flags & AA_CHANGE_TEST)) { /* * no new privs prevents domain transitions that would * reduce restrictions. */ if (task_no_new_privs(current) && !unconfined(label) && !aa_label_is_unconfined_subset(previous, ctx->nnp)) { /* not an apparmor denial per se, so don't log it */ AA_DEBUG(DEBUG_DOMAIN, "no_new_privs - change_hat denied"); error = -EPERM; goto out; } /* Return to saved label. Kill task if restore fails * to avoid brute force attacks */ target = previous; error = aa_restore_previous_label(token); if (error) { if (error == -EACCES) goto kill; goto fail; } } /* else ignore @flags && restores when there is no saved profile */ out: aa_put_label(new); aa_put_label(previous); aa_put_label(label); put_cred(subj_cred); return error; kill: info = "failed token match"; perms.kill = AA_MAY_CHANGEHAT; fail: fn_for_each_in_ns(label, profile, aa_audit_file(subj_cred, profile, &perms, OP_CHANGE_HAT, AA_MAY_CHANGEHAT, NULL, NULL, target, GLOBAL_ROOT_UID, info, error)); goto out; } static int change_profile_perms_wrapper(const char *op, const char *name, const struct cred *subj_cred, struct aa_profile *profile, struct aa_label *target, bool stack, u32 request, struct aa_perms *perms) { struct aa_ruleset *rules = profile->label.rules[0]; const char *info = NULL; int error = 0; if (!error) error = change_profile_perms(profile, target, stack, request, rules->file->start[AA_CLASS_FILE], perms); if (error) error = aa_audit_file(subj_cred, profile, perms, op, request, name, NULL, target, GLOBAL_ROOT_UID, info, error); return error; } static const char *stack_msg = "change_profile unprivileged unconfined converted to stacking"; /** * aa_change_profile - perform a one-way profile transition * @fqname: name of profile may include namespace (NOT NULL) * @flags: flags affecting change behavior * * Change to new profile @name. Unlike with hats, there is no way * to change back. If @name isn't specified the current profile name is * used. * If @onexec then the transition is delayed until * the next exec. * * Returns %0 on success, error otherwise. */ int aa_change_profile(const char *fqname, int flags) { struct aa_label *label, *new = NULL, *target = NULL; struct aa_profile *profile; struct aa_perms perms = {}; const char *info = NULL; const char *auditname = fqname; /* retain leading & if stack */ bool stack = flags & AA_CHANGE_STACK; struct aa_task_ctx *ctx = task_ctx(current); const struct cred *subj_cred = get_current_cred(); int error = 0; char *op; u32 request; label = aa_get_current_label(); /* * Detect no new privs being set, and store the label it * occurred under. Ideally this would happen when nnp * is set but there isn't a good way to do that yet. * * Testing for unconfined must be done before the subset test */ if (task_no_new_privs(current) && !unconfined(label) && !ctx->nnp) ctx->nnp = aa_get_label(label); if (!fqname || !*fqname) { aa_put_label(label); AA_DEBUG(DEBUG_DOMAIN, "no profile name"); return -EINVAL; } if (flags & AA_CHANGE_ONEXEC) { request = AA_MAY_ONEXEC; if (stack) op = OP_STACK_ONEXEC; else op = OP_CHANGE_ONEXEC; } else { request = AA_MAY_CHANGE_PROFILE; if (stack) op = OP_STACK; else op = OP_CHANGE_PROFILE; } /* This should move to a per profile test. Requires pushing build * into callback */ if (!stack && unconfined(label) && label == &labels_ns(label)->unconfined->label && aa_unprivileged_unconfined_restricted && /* TODO: refactor so this check is a fn */ cap_capable(current_cred(), &init_user_ns, CAP_MAC_OVERRIDE, CAP_OPT_NOAUDIT)) { /* regardless of the request in this case apparmor * stacks against unconfined so admin set policy can't be * by-passed */ stack = true; perms.audit = request; (void) fn_for_each_in_ns(label, profile, aa_audit_file(subj_cred, profile, &perms, op, request, auditname, NULL, target, GLOBAL_ROOT_UID, stack_msg, 0)); perms.audit = 0; } if (*fqname == '&') { stack = true; /* don't have label_parse() do stacking */ fqname++; } target = aa_label_parse(label, fqname, GFP_KERNEL, true, false); if (IS_ERR(target)) { struct aa_profile *tprofile; info = "label not found"; error = PTR_ERR(target); target = NULL; /* * TODO: fixme using labels_profile is not right - do profile * per complain profile */ if ((flags & AA_CHANGE_TEST) || !COMPLAIN_MODE(labels_profile(label))) goto audit; /* released below */ tprofile = aa_new_learning_profile(labels_profile(label), false, fqname, GFP_KERNEL); if (!tprofile) { info = "failed null profile create"; error = -ENOMEM; goto audit; } target = &tprofile->label; goto check; } /* * self directed transitions only apply to current policy ns * TODO: currently requiring perms for stacking and straight change * stacking doesn't strictly need this. Determine how much * we want to loosen this restriction for stacking * * if (!stack) { */ error = fn_for_each_in_ns(label, profile, change_profile_perms_wrapper(op, auditname, subj_cred, profile, target, stack, request, &perms)); if (error) /* auditing done in change_profile_perms_wrapper */ goto out; /* } */ check: /* check if tracing task is allowed to trace target domain */ error = may_change_ptraced_domain(subj_cred, target, &info); if (error && !fn_for_each_in_ns(label, profile, COMPLAIN_MODE(profile))) goto audit; /* TODO: add permission check to allow this * if ((flags & AA_CHANGE_ONEXEC) && !current_is_single_threaded()) { * info = "not a single threaded task"; * error = -EACCES; * goto audit; * } */ if (flags & AA_CHANGE_TEST) goto out; /* stacking is always a subset, so only check the nonstack case */ if (!stack) { new = fn_label_build_in_ns(label, profile, GFP_KERNEL, aa_get_label(target), aa_get_label(&profile->label)); /* * no new privs prevents domain transitions that would * reduce restrictions. */ if (task_no_new_privs(current) && !unconfined(label) && !aa_label_is_unconfined_subset(new, ctx->nnp)) { /* not an apparmor denial per se, so don't log it */ AA_DEBUG(DEBUG_DOMAIN, "no_new_privs - change_hat denied"); error = -EPERM; goto out; } } if (!(flags & AA_CHANGE_ONEXEC)) { /* only transition profiles in the current ns */ if (stack) new = aa_label_merge(label, target, GFP_KERNEL); if (IS_ERR_OR_NULL(new)) { info = "failed to build target label"; if (!new) error = -ENOMEM; else error = PTR_ERR(new); new = NULL; perms.allow = 0; goto audit; } error = aa_replace_current_label(new); } else { if (new) { aa_put_label(new); new = NULL; } /* full transition will be built in exec path */ aa_set_current_onexec(target, stack); } audit: error = fn_for_each_in_ns(label, profile, aa_audit_file(subj_cred, profile, &perms, op, request, auditname, NULL, new ? new : target, GLOBAL_ROOT_UID, info, error)); out: aa_put_label(new); aa_put_label(target); aa_put_label(label); put_cred(subj_cred); return error; } |
| 10568 1786 10568 2 2 2 2 7363 7369 7377 7363 7418 9423 7069 9423 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 | /* SPDX-License-Identifier: GPL-2.0-only */ /* * AppArmor security module * * This file contains AppArmor label definitions * * Copyright 2017 Canonical Ltd. */ #ifndef __AA_LABEL_H #define __AA_LABEL_H #include <linux/atomic.h> #include <linux/audit.h> #include <linux/rbtree.h> #include <linux/rcupdate.h> #include "apparmor.h" #include "lib.h" struct aa_ns; struct aa_ruleset; #define LOCAL_VEC_ENTRIES 8 #define DEFINE_VEC(T, V) \ struct aa_ ## T *(_ ## V ## _localtmp)[LOCAL_VEC_ENTRIES]; \ struct aa_ ## T **(V) #define vec_setup(T, V, N, GFP) \ ({ \ if ((N) <= LOCAL_VEC_ENTRIES) { \ typeof(N) i; \ (V) = (_ ## V ## _localtmp); \ for (i = 0; i < (N); i++) \ (V)[i] = NULL; \ } else \ (V) = kzalloc(sizeof(struct aa_ ## T *) * (N), (GFP)); \ (V) ? 0 : -ENOMEM; \ }) #define vec_cleanup(T, V, N) \ do { \ int i; \ for (i = 0; i < (N); i++) { \ if (!IS_ERR_OR_NULL((V)[i])) \ aa_put_ ## T((V)[i]); \ } \ if ((V) != _ ## V ## _localtmp) \ kfree(V); \ } while (0) #define vec_last(VEC, SIZE) ((VEC)[(SIZE) - 1]) #define vec_ns(VEC, SIZE) (vec_last((VEC), (SIZE))->ns) #define vec_labelset(VEC, SIZE) (&vec_ns((VEC), (SIZE))->labels) #define cleanup_domain_vec(V, L) cleanup_label_vec((V), (L)->size) struct aa_profile; #define VEC_FLAG_TERMINATE 1 int aa_vec_unique(struct aa_profile **vec, int n, int flags); struct aa_label *aa_vec_find_or_create_label(struct aa_profile **vec, int len, gfp_t gfp); #define aa_sort_and_merge_vec(N, V) \ aa_sort_and_merge_profiles((N), (struct aa_profile **)(V)) /* struct aa_labelset - set of labels for a namespace * * Labels are reference counted; aa_labelset does not contribute to label * reference counts. Once a label's last refcount is put it is removed from * the set. */ struct aa_labelset { rwlock_t lock; struct rb_root root; }; #define __labelset_for_each(LS, N) \ for ((N) = rb_first(&(LS)->root); (N); (N) = rb_next(N)) enum label_flags { FLAG_HAT = 1, /* profile is a hat */ FLAG_UNCONFINED = 2, /* label unconfined only if all */ FLAG_NULL = 4, /* profile is null learning profile */ FLAG_IX_ON_NAME_ERROR = 8, /* fallback to ix on name lookup fail */ FLAG_IMMUTIBLE = 0x10, /* don't allow changes/replacement */ FLAG_USER_DEFINED = 0x20, /* user based profile - lower privs */ FLAG_NO_LIST_REF = 0x40, /* list doesn't keep profile ref */ FLAG_NS_COUNT = 0x80, /* carries NS ref count */ FLAG_IN_TREE = 0x100, /* label is in tree */ FLAG_PROFILE = 0x200, /* label is a profile */ FLAG_EXPLICIT = 0x400, /* explicit static label */ FLAG_STALE = 0x800, /* replaced/removed */ FLAG_RENAMED = 0x1000, /* label has renaming in it */ FLAG_REVOKED = 0x2000, /* label has revocation in it */ FLAG_DEBUG1 = 0x4000, FLAG_DEBUG2 = 0x8000, /* These flags must correspond with PATH_flags */ /* TODO: add new path flags */ }; struct aa_label; struct aa_proxy { struct kref count; struct aa_label __rcu *label; }; struct label_it { int i, j; }; /* struct aa_label_base - base info of label * @count: ref count of active users * @node: rbtree position * @rcu: rcu callback struct * @proxy: is set to the label that replaced this label * @hname: text representation of the label (MAYBE_NULL) * @flags: stale and other flags - values may change under label set lock * @secid: secid that references this label * @size: number of entries in @ent[] * @mediates: bitmask for label_mediates * profile: label vec when embedded in a profile FLAG_PROFILE is set * rules: variable length rules in a profile FLAG_PROFILE is set * vec: vector of profiles comprising the compound label */ struct aa_label { struct kref count; struct rb_node node; struct rcu_head rcu; struct aa_proxy *proxy; __counted char *hname; long flags; u32 secid; int size; u64 mediates; union { struct { /* only used is the label is a profile, size of * rules[] is determined by the profile * profile[1] is poison or null as guard */ struct aa_profile *profile[2]; DECLARE_FLEX_ARRAY(struct aa_ruleset *, rules); }; DECLARE_FLEX_ARRAY(struct aa_profile *, vec); }; }; #define last_error(E, FN) \ do { \ int __subE = (FN); \ if (__subE) \ (E) = __subE; \ } while (0) #define label_isprofile(X) ((X)->flags & FLAG_PROFILE) #define label_unconfined(X) ((X)->flags & FLAG_UNCONFINED) #define unconfined(X) label_unconfined(X) #define label_is_stale(X) ((X)->flags & FLAG_STALE) #define __label_make_stale(X) ((X)->flags |= FLAG_STALE) #define labels_ns(X) (vec_ns(&((X)->vec[0]), (X)->size)) #define labels_set(X) (&labels_ns(X)->labels) #define labels_view(X) labels_ns(X) #define labels_profile(X) ((X)->vec[(X)->size - 1]) int aa_label_next_confined(struct aa_label *l, int i); /* for each profile in a label */ #define label_for_each(I, L, P) \ for ((I).i = 0; ((P) = (L)->vec[(I).i]); ++((I).i)) /* assumes break/goto ended label_for_each */ #define label_for_each_cont(I, L, P) \ for (++((I).i); ((P) = (L)->vec[(I).i]); ++((I).i)) /* for each profile that is enforcing confinement in a label */ #define label_for_each_confined(I, L, P) \ for ((I).i = aa_label_next_confined((L), 0); \ ((P) = (L)->vec[(I).i]); \ (I).i = aa_label_next_confined((L), (I).i + 1)) #define label_for_each_in_merge(I, A, B, P) \ for ((I).i = (I).j = 0; \ ((P) = aa_label_next_in_merge(&(I), (A), (B))); \ ) #define label_for_each_not_in_set(I, SET, SUB, P) \ for ((I).i = (I).j = 0; \ ((P) = __aa_label_next_not_in_set(&(I), (SET), (SUB))); \ ) #define next_in_ns(i, NS, L) \ ({ \ typeof(i) ___i = (i); \ while ((L)->vec[___i] && (L)->vec[___i]->ns != (NS)) \ (___i)++; \ (___i); \ }) #define label_for_each_in_ns(I, NS, L, P) \ for ((I).i = next_in_ns(0, (NS), (L)); \ ((P) = (L)->vec[(I).i]); \ (I).i = next_in_ns((I).i + 1, (NS), (L))) #define fn_for_each_in_ns(L, P, FN) \ ({ \ struct label_it __i; \ struct aa_ns *__ns = labels_ns(L); \ int __E = 0; \ label_for_each_in_ns(__i, __ns, (L), (P)) { \ last_error(__E, (FN)); \ } \ __E; \ }) #define fn_for_each_XXX(L, P, FN, ...) \ ({ \ struct label_it i; \ int __E = 0; \ label_for_each ## __VA_ARGS__(i, (L), (P)) { \ last_error(__E, (FN)); \ } \ __E; \ }) #define fn_for_each(L, P, FN) fn_for_each_XXX(L, P, FN) #define fn_for_each_confined(L, P, FN) fn_for_each_XXX(L, P, FN, _confined) #define fn_for_each2_XXX(L1, L2, P, FN, ...) \ ({ \ struct label_it i; \ int __E = 0; \ label_for_each ## __VA_ARGS__(i, (L1), (L2), (P)) { \ last_error(__E, (FN)); \ } \ __E; \ }) #define fn_for_each_in_merge(L1, L2, P, FN) \ fn_for_each2_XXX((L1), (L2), P, FN, _in_merge) #define fn_for_each_not_in_set(L1, L2, P, FN) \ fn_for_each2_XXX((L1), (L2), P, FN, _not_in_set) static inline bool label_mediates(struct aa_label *L, unsigned char C) { return (L)->mediates & (((u64) 1) << (C)); } static inline bool label_mediates_safe(struct aa_label *L, unsigned char C) { if (C > AA_CLASS_LAST) return false; return label_mediates(L, C); } void aa_labelset_destroy(struct aa_labelset *ls); void aa_labelset_init(struct aa_labelset *ls); void __aa_labelset_update_subtree(struct aa_ns *ns); void aa_label_destroy(struct aa_label *label); void aa_label_free(struct aa_label *label); void aa_label_kref(struct kref *kref); bool aa_label_init(struct aa_label *label, int size, gfp_t gfp); struct aa_label *aa_label_alloc(int size, struct aa_proxy *proxy, gfp_t gfp); bool aa_label_is_subset(struct aa_label *set, struct aa_label *sub); bool aa_label_is_unconfined_subset(struct aa_label *set, struct aa_label *sub); struct aa_profile *__aa_label_next_not_in_set(struct label_it *I, struct aa_label *set, struct aa_label *sub); bool aa_label_remove(struct aa_label *label); struct aa_label *aa_label_insert(struct aa_labelset *ls, struct aa_label *l); bool aa_label_replace(struct aa_label *old, struct aa_label *new); bool aa_label_make_newest(struct aa_labelset *ls, struct aa_label *old, struct aa_label *new); struct aa_profile *aa_label_next_in_merge(struct label_it *I, struct aa_label *a, struct aa_label *b); struct aa_label *aa_label_find_merge(struct aa_label *a, struct aa_label *b); struct aa_label *aa_label_merge(struct aa_label *a, struct aa_label *b, gfp_t gfp); bool aa_update_label_name(struct aa_ns *ns, struct aa_label *label, gfp_t gfp); #define FLAGS_NONE 0 #define FLAG_SHOW_MODE 1 #define FLAG_VIEW_SUBNS 2 #define FLAG_HIDDEN_UNCONFINED 4 #define FLAG_ABS_ROOT 8 int aa_label_snxprint(char *str, size_t size, struct aa_ns *view, struct aa_label *label, int flags); int aa_label_asxprint(char **strp, struct aa_ns *ns, struct aa_label *label, int flags, gfp_t gfp); int aa_label_acntsxprint(char __counted **strp, struct aa_ns *ns, struct aa_label *label, int flags, gfp_t gfp); void aa_label_xaudit(struct audit_buffer *ab, struct aa_ns *ns, struct aa_label *label, int flags, gfp_t gfp); void aa_label_seq_xprint(struct seq_file *f, struct aa_ns *ns, struct aa_label *label, int flags, gfp_t gfp); void aa_label_xprintk(struct aa_ns *ns, struct aa_label *label, int flags, gfp_t gfp); void aa_label_printk(struct aa_label *label, gfp_t gfp); struct aa_label *aa_label_strn_parse(struct aa_label *base, const char *str, size_t n, gfp_t gfp, bool create, bool force_stack); struct aa_label *aa_label_parse(struct aa_label *base, const char *str, gfp_t gfp, bool create, bool force_stack); static inline const char *aa_label_strn_split(const char *str, int n) { const char *pos; aa_state_t state; state = aa_dfa_matchn_until(stacksplitdfa, DFA_START, str, n, &pos); if (!ACCEPT_TABLE(stacksplitdfa)[state]) return NULL; return pos - 3; } static inline const char *aa_label_str_split(const char *str) { const char *pos; aa_state_t state; state = aa_dfa_match_until(stacksplitdfa, DFA_START, str, &pos); if (!ACCEPT_TABLE(stacksplitdfa)[state]) return NULL; return pos - 3; } struct aa_perms; struct aa_ruleset; int aa_label_match(struct aa_profile *profile, struct aa_ruleset *rules, struct aa_label *label, aa_state_t state, bool subns, u32 request, struct aa_perms *perms); /** * __aa_get_label - get a reference count to uncounted label reference * @l: reference to get a count on * * Returns: pointer to reference OR NULL if race is lost and reference is * being repeated. * Requires: lock held, and the return code MUST be checked */ static inline struct aa_label *__aa_get_label(struct aa_label *l) { if (l && kref_get_unless_zero(&l->count)) return l; return NULL; } static inline struct aa_label *aa_get_label(struct aa_label *l) { if (l) kref_get(&(l->count)); return l; } /** * aa_get_label_rcu - increment refcount on a label that can be replaced * @l: pointer to label that can be replaced (NOT NULL) * * Returns: pointer to a refcounted label. * else NULL if no label */ static inline struct aa_label *aa_get_label_rcu(struct aa_label __rcu **l) { struct aa_label *c; rcu_read_lock(); do { c = rcu_dereference(*l); } while (c && !kref_get_unless_zero(&c->count)); rcu_read_unlock(); return c; } /** * aa_get_newest_label - find the newest version of @l * @l: the label to check for newer versions of * * Returns: refcounted newest version of @l taking into account * replacement, renames and removals * return @l. */ static inline struct aa_label *aa_get_newest_label(struct aa_label *l) { if (!l) return NULL; if (label_is_stale(l)) { struct aa_label *tmp; AA_BUG(!l->proxy); AA_BUG(!l->proxy->label); /* BUG: only way this can happen is @l ref count and its * replacement count have gone to 0 and are on their way * to destruction. ie. we have a refcounting error */ tmp = aa_get_label_rcu(&l->proxy->label); AA_BUG(!tmp); return tmp; } return aa_get_label(l); } static inline void aa_put_label(struct aa_label *l) { if (l) kref_put(&l->count, aa_label_kref); } /* wrapper fn to indicate semantics of the check */ static inline bool __aa_subj_label_is_cached(struct aa_label *subj_label, struct aa_label *obj_label) { return aa_label_is_subset(obj_label, subj_label); } struct aa_proxy *aa_alloc_proxy(struct aa_label *l, gfp_t gfp); void aa_proxy_kref(struct kref *kref); static inline struct aa_proxy *aa_get_proxy(struct aa_proxy *proxy) { if (proxy) kref_get(&(proxy->count)); return proxy; } static inline void aa_put_proxy(struct aa_proxy *proxy) { if (proxy) kref_put(&proxy->count, aa_proxy_kref); } void __aa_proxy_redirect(struct aa_label *orig, struct aa_label *new); #endif /* __AA_LABEL_H */ |
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 | /* SPDX-License-Identifier: GPL-2.0 */ /* Interface for implementing AF_XDP zero-copy support in drivers. * Copyright(c) 2020 Intel Corporation. */ #ifndef _LINUX_XDP_SOCK_DRV_H #define _LINUX_XDP_SOCK_DRV_H #include <net/xdp_sock.h> #include <net/xsk_buff_pool.h> #define XDP_UMEM_MIN_CHUNK_SHIFT 11 #define XDP_UMEM_MIN_CHUNK_SIZE (1 << XDP_UMEM_MIN_CHUNK_SHIFT) struct xsk_cb_desc { void *src; u8 off; u8 bytes; }; #ifdef CONFIG_XDP_SOCKETS void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries); bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc); u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 max); void xsk_tx_release(struct xsk_buff_pool *pool); struct xsk_buff_pool *xsk_get_pool_from_qid(struct net_device *dev, u16 queue_id); void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool); void xsk_set_tx_need_wakeup(struct xsk_buff_pool *pool); void xsk_clear_rx_need_wakeup(struct xsk_buff_pool *pool); void xsk_clear_tx_need_wakeup(struct xsk_buff_pool *pool); bool xsk_uses_need_wakeup(struct xsk_buff_pool *pool); static inline u32 xsk_pool_get_headroom(struct xsk_buff_pool *pool) { return XDP_PACKET_HEADROOM + pool->headroom; } static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool) { return pool->chunk_size; } static inline u32 xsk_pool_get_rx_frame_size(struct xsk_buff_pool *pool) { return xsk_pool_get_chunk_size(pool) - xsk_pool_get_headroom(pool); } static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool, struct xdp_rxq_info *rxq) { xp_set_rxq_info(pool, rxq); } static inline void xsk_pool_fill_cb(struct xsk_buff_pool *pool, struct xsk_cb_desc *desc) { xp_fill_cb(pool, desc); } static inline void xsk_pool_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs) { xp_dma_unmap(pool, attrs); } static inline int xsk_pool_dma_map(struct xsk_buff_pool *pool, struct device *dev, unsigned long attrs) { struct xdp_umem *umem = pool->umem; return xp_dma_map(pool, dev, attrs, umem->pgs, umem->npgs); } static inline dma_addr_t xsk_buff_xdp_get_dma(struct xdp_buff *xdp) { struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp); return xp_get_dma(xskb); } static inline dma_addr_t xsk_buff_xdp_get_frame_dma(struct xdp_buff *xdp) { struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp); return xp_get_frame_dma(xskb); } static inline struct xdp_buff *xsk_buff_alloc(struct xsk_buff_pool *pool) { return xp_alloc(pool); } static inline bool xsk_is_eop_desc(const struct xdp_desc *desc) { return !xp_mb_desc(desc); } /* Returns as many entries as possible up to max. 0 <= N <= max. */ static inline u32 xsk_buff_alloc_batch(struct xsk_buff_pool *pool, struct xdp_buff **xdp, u32 max) { return xp_alloc_batch(pool, xdp, max); } static inline bool xsk_buff_can_alloc(struct xsk_buff_pool *pool, u32 count) { return xp_can_alloc(pool, count); } static inline void xsk_buff_free(struct xdp_buff *xdp) { struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp); struct list_head *xskb_list = &xskb->pool->xskb_list; struct xdp_buff_xsk *pos, *tmp; if (likely(!xdp_buff_has_frags(xdp))) goto out; list_for_each_entry_safe(pos, tmp, xskb_list, list_node) { list_del(&pos->list_node); xp_free(pos); } xdp_get_shared_info_from_buff(xdp)->nr_frags = 0; out: xp_free(xskb); } static inline bool xsk_buff_add_frag(struct xdp_buff *head, struct xdp_buff *xdp) { const void *data = xdp->data; struct xdp_buff_xsk *frag; if (!__xdp_buff_add_frag(head, virt_to_netmem(data), offset_in_page(data), xdp->data_end - data, xdp->frame_sz, false)) return false; frag = container_of(xdp, struct xdp_buff_xsk, xdp); list_add_tail(&frag->list_node, &frag->pool->xskb_list); return true; } static inline struct xdp_buff *xsk_buff_get_frag(const struct xdp_buff *first) { struct xdp_buff_xsk *xskb = container_of(first, struct xdp_buff_xsk, xdp); struct xdp_buff *ret = NULL; struct xdp_buff_xsk *frag; frag = list_first_entry_or_null(&xskb->pool->xskb_list, struct xdp_buff_xsk, list_node); if (frag) { list_del(&frag->list_node); ret = &frag->xdp; } return ret; } static inline void xsk_buff_del_frag(struct xdp_buff *xdp) { struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp); list_del(&xskb->list_node); } static inline struct xdp_buff *xsk_buff_get_head(struct xdp_buff *first) { struct xdp_buff_xsk *xskb = container_of(first, struct xdp_buff_xsk, xdp); struct xdp_buff_xsk *frag; frag = list_first_entry(&xskb->pool->xskb_list, struct xdp_buff_xsk, list_node); return &frag->xdp; } static inline struct xdp_buff *xsk_buff_get_tail(struct xdp_buff *first) { struct xdp_buff_xsk *xskb = container_of(first, struct xdp_buff_xsk, xdp); struct xdp_buff_xsk *frag; frag = list_last_entry(&xskb->pool->xskb_list, struct xdp_buff_xsk, list_node); return &frag->xdp; } static inline void xsk_buff_set_size(struct xdp_buff *xdp, u32 size) { xdp->data = xdp->data_hard_start + XDP_PACKET_HEADROOM; xdp->data_meta = xdp->data; xdp->data_end = xdp->data + size; xdp->flags = 0; } static inline dma_addr_t xsk_buff_raw_get_dma(struct xsk_buff_pool *pool, u64 addr) { return xp_raw_get_dma(pool, addr); } static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr) { return xp_raw_get_data(pool, addr); } /** * xsk_buff_raw_get_ctx - get &xdp_desc context * @pool: XSk buff pool desc address belongs to * @addr: desc address (from userspace) * * Wrapper for xp_raw_get_ctx() to be used in drivers, see its kdoc for * details. * * Return: new &xdp_desc_ctx struct containing desc's DMA address and metadata * pointer, if it is present and valid (initialized to %NULL otherwise). */ static inline struct xdp_desc_ctx xsk_buff_raw_get_ctx(const struct xsk_buff_pool *pool, u64 addr) { return xp_raw_get_ctx(pool, addr); } #define XDP_TXMD_FLAGS_VALID ( \ XDP_TXMD_FLAGS_TIMESTAMP | \ XDP_TXMD_FLAGS_CHECKSUM | \ XDP_TXMD_FLAGS_LAUNCH_TIME | \ 0) static inline bool xsk_buff_valid_tx_metadata(const struct xsk_tx_metadata *meta) { return !(meta->flags & ~XDP_TXMD_FLAGS_VALID); } static inline struct xsk_tx_metadata * __xsk_buff_get_metadata(const struct xsk_buff_pool *pool, void *data) { struct xsk_tx_metadata *meta; if (!pool->tx_metadata_len) return NULL; meta = data - pool->tx_metadata_len; if (unlikely(!xsk_buff_valid_tx_metadata(meta))) return NULL; /* no way to signal the error to the user */ return meta; } static inline struct xsk_tx_metadata * xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr) { return __xsk_buff_get_metadata(pool, xp_raw_get_data(pool, addr)); } static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp) { struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp); xp_dma_sync_for_cpu(xskb); } static inline void xsk_buff_raw_dma_sync_for_device(struct xsk_buff_pool *pool, dma_addr_t dma, size_t size) { xp_dma_sync_for_device(pool, dma, size); } #else static inline void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries) { } static inline bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc) { return false; } static inline u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 max) { return 0; } static inline void xsk_tx_release(struct xsk_buff_pool *pool) { } static inline struct xsk_buff_pool * xsk_get_pool_from_qid(struct net_device *dev, u16 queue_id) { return NULL; } static inline void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool) { } static inline void xsk_set_tx_need_wakeup(struct xsk_buff_pool *pool) { } static inline void xsk_clear_rx_need_wakeup(struct xsk_buff_pool *pool) { } static inline void xsk_clear_tx_need_wakeup(struct xsk_buff_pool *pool) { } static inline bool xsk_uses_need_wakeup(struct xsk_buff_pool *pool) { return false; } static inline u32 xsk_pool_get_headroom(struct xsk_buff_pool *pool) { return 0; } static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool) { return 0; } static inline u32 xsk_pool_get_rx_frame_size(struct xsk_buff_pool *pool) { return 0; } static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool, struct xdp_rxq_info *rxq) { } static inline void xsk_pool_fill_cb(struct xsk_buff_pool *pool, struct xsk_cb_desc *desc) { } static inline void xsk_pool_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs) { } static inline int xsk_pool_dma_map(struct xsk_buff_pool *pool, struct device *dev, unsigned long attrs) { return 0; } static inline dma_addr_t xsk_buff_xdp_get_dma(struct xdp_buff *xdp) { return 0; } static inline dma_addr_t xsk_buff_xdp_get_frame_dma(struct xdp_buff *xdp) { return 0; } static inline struct xdp_buff *xsk_buff_alloc(struct xsk_buff_pool *pool) { return NULL; } static inline bool xsk_is_eop_desc(const struct xdp_desc *desc) { return false; } static inline u32 xsk_buff_alloc_batch(struct xsk_buff_pool *pool, struct xdp_buff **xdp, u32 max) { return 0; } static inline bool xsk_buff_can_alloc(struct xsk_buff_pool *pool, u32 count) { return false; } static inline void xsk_buff_free(struct xdp_buff *xdp) { } static inline bool xsk_buff_add_frag(struct xdp_buff *head, struct xdp_buff *xdp) { return false; } static inline struct xdp_buff *xsk_buff_get_frag(const struct xdp_buff *first) { return NULL; } static inline void xsk_buff_del_frag(struct xdp_buff *xdp) { } static inline struct xdp_buff *xsk_buff_get_head(struct xdp_buff *first) { return NULL; } static inline struct xdp_buff *xsk_buff_get_tail(struct xdp_buff *first) { return NULL; } static inline void xsk_buff_set_size(struct xdp_buff *xdp, u32 size) { } static inline dma_addr_t xsk_buff_raw_get_dma(struct xsk_buff_pool *pool, u64 addr) { return 0; } static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr) { return NULL; } static inline struct xdp_desc_ctx xsk_buff_raw_get_ctx(const struct xsk_buff_pool *pool, u64 addr) { return (struct xdp_desc_ctx){ }; } static inline bool xsk_buff_valid_tx_metadata(struct xsk_tx_metadata *meta) { return false; } static inline struct xsk_tx_metadata * __xsk_buff_get_metadata(const struct xsk_buff_pool *pool, void *data) { return NULL; } static inline struct xsk_tx_metadata * xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr) { return NULL; } static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp) { } static inline void xsk_buff_raw_dma_sync_for_device(struct xsk_buff_pool *pool, dma_addr_t dma, size_t size) { } #endif /* CONFIG_XDP_SOCKETS */ #endif /* _LINUX_XDP_SOCK_DRV_H */ |
| 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 7 5 3 1 2 1 1 1 1 1 3 2 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 17 8 9 16 1 15 15 5 4 4 3 3 1 2 14 14 14 6 6 8 14 7 7 3 4 3 4 10 10 10 10 1 1 1 1 1 10 10 10 10 10 10 10 31 28 29 27 27 26 2 2 1 24 1 17 1 3 1 2 1 25 31 3 2 2 2 3 850 849 13 13 4 3 4 13 6 6 6 11 10 10 10 3 10 10 3 3 3 10 3 10 3 10 11 6 5 5 3 2 1 1 2 3 3 3 5 6 4 1 3 3 1 2 2 2 4 1 400 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 | // SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) /* * bcm.c - Broadcast Manager to filter/send (cyclic) CAN content * * Copyright (c) 2002-2017 Volkswagen Group Electronic Research * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of Volkswagen nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * Alternatively, provided that this notice is retained in full, this * software may be distributed under the terms of the GNU General * Public License ("GPL") version 2, in which case the provisions of the * GPL apply INSTEAD OF those given above. * * The provided data structures and external interfaces from this code * are not restricted to be used by modules with a GPL compatible license. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH * DAMAGE. * */ #include <linux/module.h> #include <linux/init.h> #include <linux/interrupt.h> #include <linux/hrtimer.h> #include <linux/list.h> #include <linux/proc_fs.h> #include <linux/seq_file.h> #include <linux/uio.h> #include <linux/net.h> #include <linux/netdevice.h> #include <linux/socket.h> #include <linux/if_arp.h> #include <linux/skbuff.h> #include <linux/can.h> #include <linux/can/core.h> #include <linux/can/skb.h> #include <linux/can/bcm.h> #include <linux/slab.h> #include <linux/spinlock.h> #include <net/sock.h> #include <net/net_namespace.h> /* * To send multiple CAN frame content within TX_SETUP or to filter * CAN messages with multiplex index within RX_SETUP, the number of * different filters is limited to 256 due to the one byte index value. */ #define MAX_NFRAMES 256 /* limit timers to 400 days for sending/timeouts */ #define BCM_TIMER_SEC_MAX (400 * 24 * 60 * 60) /* use of last_frames[index].flags */ #define RX_LOCAL 0x10 /* frame was created on the local host */ #define RX_OWN 0x20 /* frame was sent via the socket it was received on */ #define RX_RECV 0x40 /* received data for this element */ #define RX_THR 0x80 /* element not been sent due to throttle feature */ #define BCM_CAN_FLAGS_MASK 0x0F /* to clean private flags after usage */ /* get best masking value for can_rx_register() for a given single can_id */ #define REGMASK(id) ((id & CAN_EFF_FLAG) ? \ (CAN_EFF_MASK | CAN_EFF_FLAG | CAN_RTR_FLAG) : \ (CAN_SFF_MASK | CAN_EFF_FLAG | CAN_RTR_FLAG)) MODULE_DESCRIPTION("PF_CAN broadcast manager protocol"); MODULE_LICENSE("Dual BSD/GPL"); MODULE_AUTHOR("Oliver Hartkopp <oliver.hartkopp@volkswagen.de>"); MODULE_ALIAS("can-proto-2"); #define BCM_MIN_NAMELEN CAN_REQUIRED_SIZE(struct sockaddr_can, can_ifindex) /* * easy access to the first 64 bit of can(fd)_frame payload. cp->data is * 64 bit aligned so the offset has to be multiples of 8 which is ensured * by the only callers in bcm_rx_cmp_to_index() bcm_rx_handler(). */ static inline u64 get_u64(const struct canfd_frame *cp, int offset) { return *(u64 *)(cp->data + offset); } struct bcm_op { struct list_head list; struct rcu_head rcu; int ifindex; canid_t can_id; u32 flags; unsigned long frames_abs, frames_filtered; struct bcm_timeval ival1, ival2; struct hrtimer timer, thrtimer; ktime_t rx_stamp, kt_ival1, kt_ival2, kt_lastmsg; int rx_ifindex; int cfsiz; u32 count; u32 nframes; u32 currframe; /* void pointers to arrays of struct can[fd]_frame */ void *frames; void *last_frames; struct canfd_frame sframe; struct canfd_frame last_sframe; struct sock *sk; struct net_device *rx_reg_dev; spinlock_t bcm_tx_lock; /* protect currframe/count in runtime updates */ }; struct bcm_sock { struct sock sk; int bound; int ifindex; struct list_head notifier; struct list_head rx_ops; struct list_head tx_ops; unsigned long dropped_usr_msgs; struct proc_dir_entry *bcm_proc_read; char procname [32]; /* inode number in decimal with \0 */ }; static LIST_HEAD(bcm_notifier_list); static DEFINE_SPINLOCK(bcm_notifier_lock); static struct bcm_sock *bcm_busy_notifier; /* Return pointer to store the extra msg flags for bcm_recvmsg(). * We use the space of one unsigned int beyond the 'struct sockaddr_can' * in skb->cb. */ static inline unsigned int *bcm_flags(struct sk_buff *skb) { /* return pointer after struct sockaddr_can */ return (unsigned int *)(&((struct sockaddr_can *)skb->cb)[1]); } static inline struct bcm_sock *bcm_sk(const struct sock *sk) { return (struct bcm_sock *)sk; } static inline ktime_t bcm_timeval_to_ktime(struct bcm_timeval tv) { return ktime_set(tv.tv_sec, tv.tv_usec * NSEC_PER_USEC); } /* check limitations for timeval provided by user */ static bool bcm_is_invalid_tv(struct bcm_msg_head *msg_head) { if ((msg_head->ival1.tv_sec < 0) || (msg_head->ival1.tv_sec > BCM_TIMER_SEC_MAX) || (msg_head->ival1.tv_usec < 0) || (msg_head->ival1.tv_usec >= USEC_PER_SEC) || (msg_head->ival2.tv_sec < 0) || (msg_head->ival2.tv_sec > BCM_TIMER_SEC_MAX) || (msg_head->ival2.tv_usec < 0) || (msg_head->ival2.tv_usec >= USEC_PER_SEC)) return true; return false; } #define CFSIZ(flags) ((flags & CAN_FD_FRAME) ? CANFD_MTU : CAN_MTU) #define OPSIZ sizeof(struct bcm_op) #define MHSIZ sizeof(struct bcm_msg_head) /* * procfs functions */ #if IS_ENABLED(CONFIG_PROC_FS) static char *bcm_proc_getifname(struct net *net, char *result, int ifindex) { struct net_device *dev; if (!ifindex) return "any"; rcu_read_lock(); dev = dev_get_by_index_rcu(net, ifindex); if (dev) strcpy(result, dev->name); else strcpy(result, "???"); rcu_read_unlock(); return result; } static int bcm_proc_show(struct seq_file *m, void *v) { char ifname[IFNAMSIZ]; struct net *net = m->private; struct sock *sk = (struct sock *)pde_data(m->file->f_inode); struct bcm_sock *bo = bcm_sk(sk); struct bcm_op *op; seq_printf(m, ">>> socket %pK", sk->sk_socket); seq_printf(m, " / sk %pK", sk); seq_printf(m, " / bo %pK", bo); seq_printf(m, " / dropped %lu", bo->dropped_usr_msgs); seq_printf(m, " / bound %s", bcm_proc_getifname(net, ifname, bo->ifindex)); seq_printf(m, " <<<\n"); rcu_read_lock(); list_for_each_entry_rcu(op, &bo->rx_ops, list) { unsigned long reduction; /* print only active entries & prevent division by zero */ if (!op->frames_abs) continue; seq_printf(m, "rx_op: %03X %-5s ", op->can_id, bcm_proc_getifname(net, ifname, op->ifindex)); if (op->flags & CAN_FD_FRAME) seq_printf(m, "(%u)", op->nframes); else seq_printf(m, "[%u]", op->nframes); seq_printf(m, "%c ", (op->flags & RX_CHECK_DLC) ? 'd' : ' '); if (op->kt_ival1) seq_printf(m, "timeo=%lld ", (long long)ktime_to_us(op->kt_ival1)); if (op->kt_ival2) seq_printf(m, "thr=%lld ", (long long)ktime_to_us(op->kt_ival2)); seq_printf(m, "# recv %ld (%ld) => reduction: ", op->frames_filtered, op->frames_abs); reduction = 100 - (op->frames_filtered * 100) / op->frames_abs; seq_printf(m, "%s%ld%%\n", (reduction == 100) ? "near " : "", reduction); } list_for_each_entry(op, &bo->tx_ops, list) { seq_printf(m, "tx_op: %03X %s ", op->can_id, bcm_proc_getifname(net, ifname, op->ifindex)); if (op->flags & CAN_FD_FRAME) seq_printf(m, "(%u) ", op->nframes); else seq_printf(m, "[%u] ", op->nframes); if (op->kt_ival1) seq_printf(m, "t1=%lld ", (long long)ktime_to_us(op->kt_ival1)); if (op->kt_ival2) seq_printf(m, "t2=%lld ", (long long)ktime_to_us(op->kt_ival2)); seq_printf(m, "# sent %ld\n", op->frames_abs); } seq_putc(m, '\n'); rcu_read_unlock(); return 0; } #endif /* CONFIG_PROC_FS */ /* * bcm_can_tx - send the (next) CAN frame to the appropriate CAN interface * of the given bcm tx op */ static void bcm_can_tx(struct bcm_op *op) { struct sk_buff *skb; struct net_device *dev; struct canfd_frame *cf; int err; /* no target device? => exit */ if (!op->ifindex) return; /* read currframe under lock protection */ spin_lock_bh(&op->bcm_tx_lock); cf = op->frames + op->cfsiz * op->currframe; spin_unlock_bh(&op->bcm_tx_lock); dev = dev_get_by_index(sock_net(op->sk), op->ifindex); if (!dev) { /* RFC: should this bcm_op remove itself here? */ return; } skb = alloc_skb(op->cfsiz + sizeof(struct can_skb_priv), gfp_any()); if (!skb) goto out; can_skb_reserve(skb); can_skb_prv(skb)->ifindex = dev->ifindex; can_skb_prv(skb)->skbcnt = 0; skb_put_data(skb, cf, op->cfsiz); /* send with loopback */ skb->dev = dev; can_skb_set_owner(skb, op->sk); err = can_send(skb, 1); /* update currframe and count under lock protection */ spin_lock_bh(&op->bcm_tx_lock); if (!err) op->frames_abs++; op->currframe++; /* reached last frame? */ if (op->currframe >= op->nframes) op->currframe = 0; if (op->count > 0) op->count--; spin_unlock_bh(&op->bcm_tx_lock); out: dev_put(dev); } /* * bcm_send_to_user - send a BCM message to the userspace * (consisting of bcm_msg_head + x CAN frames) */ static void bcm_send_to_user(struct bcm_op *op, struct bcm_msg_head *head, struct canfd_frame *frames, int has_timestamp) { struct sk_buff *skb; struct canfd_frame *firstframe; struct sockaddr_can *addr; struct sock *sk = op->sk; unsigned int datalen = head->nframes * op->cfsiz; int err; unsigned int *pflags; enum skb_drop_reason reason; skb = alloc_skb(sizeof(*head) + datalen, gfp_any()); if (!skb) return; skb_put_data(skb, head, sizeof(*head)); /* ensure space for sockaddr_can and msg flags */ sock_skb_cb_check_size(sizeof(struct sockaddr_can) + sizeof(unsigned int)); /* initialize msg flags */ pflags = bcm_flags(skb); *pflags = 0; if (head->nframes) { /* CAN frames starting here */ firstframe = (struct canfd_frame *)skb_tail_pointer(skb); skb_put_data(skb, frames, datalen); /* * the BCM uses the flags-element of the canfd_frame * structure for internal purposes. This is only * relevant for updates that are generated by the * BCM, where nframes is 1 */ if (head->nframes == 1) { if (firstframe->flags & RX_LOCAL) *pflags |= MSG_DONTROUTE; if (firstframe->flags & RX_OWN) *pflags |= MSG_CONFIRM; firstframe->flags &= BCM_CAN_FLAGS_MASK; } } if (has_timestamp) { /* restore rx timestamp */ skb->tstamp = op->rx_stamp; } /* * Put the datagram to the queue so that bcm_recvmsg() can * get it from there. We need to pass the interface index to * bcm_recvmsg(). We pass a whole struct sockaddr_can in skb->cb * containing the interface index. */ addr = (struct sockaddr_can *)skb->cb; memset(addr, 0, sizeof(*addr)); addr->can_family = AF_CAN; addr->can_ifindex = op->rx_ifindex; err = sock_queue_rcv_skb_reason(sk, skb, &reason); if (err < 0) { struct bcm_sock *bo = bcm_sk(sk); sk_skb_reason_drop(sk, skb, reason); /* don't care about overflows in this statistic */ bo->dropped_usr_msgs++; } } static bool bcm_tx_set_expiry(struct bcm_op *op, struct hrtimer *hrt) { ktime_t ival; if (op->kt_ival1 && op->count) ival = op->kt_ival1; else if (op->kt_ival2) ival = op->kt_ival2; else return false; hrtimer_set_expires(hrt, ktime_add(ktime_get(), ival)); return true; } static void bcm_tx_start_timer(struct bcm_op *op) { if (bcm_tx_set_expiry(op, &op->timer)) hrtimer_start_expires(&op->timer, HRTIMER_MODE_ABS_SOFT); } /* bcm_tx_timeout_handler - performs cyclic CAN frame transmissions */ static enum hrtimer_restart bcm_tx_timeout_handler(struct hrtimer *hrtimer) { struct bcm_op *op = container_of(hrtimer, struct bcm_op, timer); struct bcm_msg_head msg_head; if (op->kt_ival1 && (op->count > 0)) { bcm_can_tx(op); if (!op->count && (op->flags & TX_COUNTEVT)) { /* create notification to user */ memset(&msg_head, 0, sizeof(msg_head)); msg_head.opcode = TX_EXPIRED; msg_head.flags = op->flags; msg_head.count = op->count; msg_head.ival1 = op->ival1; msg_head.ival2 = op->ival2; msg_head.can_id = op->can_id; msg_head.nframes = 0; bcm_send_to_user(op, &msg_head, NULL, 0); } } else if (op->kt_ival2) { bcm_can_tx(op); } return bcm_tx_set_expiry(op, &op->timer) ? HRTIMER_RESTART : HRTIMER_NORESTART; } /* * bcm_rx_changed - create a RX_CHANGED notification due to changed content */ static void bcm_rx_changed(struct bcm_op *op, struct canfd_frame *data) { struct bcm_msg_head head; /* update statistics */ op->frames_filtered++; /* prevent statistics overflow */ if (op->frames_filtered > ULONG_MAX/100) op->frames_filtered = op->frames_abs = 0; /* this element is not throttled anymore */ data->flags &= ~RX_THR; memset(&head, 0, sizeof(head)); head.opcode = RX_CHANGED; head.flags = op->flags; head.count = op->count; head.ival1 = op->ival1; head.ival2 = op->ival2; head.can_id = op->can_id; head.nframes = 1; bcm_send_to_user(op, &head, data, 1); } /* * bcm_rx_update_and_send - process a detected relevant receive content change * 1. update the last received data * 2. send a notification to the user (if possible) */ static void bcm_rx_update_and_send(struct bcm_op *op, struct canfd_frame *lastdata, const struct canfd_frame *rxdata, unsigned char traffic_flags) { memcpy(lastdata, rxdata, op->cfsiz); /* mark as used and throttled by default */ lastdata->flags |= (RX_RECV|RX_THR); /* add own/local/remote traffic flags */ lastdata->flags |= traffic_flags; /* throttling mode inactive ? */ if (!op->kt_ival2) { /* send RX_CHANGED to the user immediately */ bcm_rx_changed(op, lastdata); return; } /* with active throttling timer we are just done here */ if (hrtimer_active(&op->thrtimer)) return; /* first reception with enabled throttling mode */ if (!op->kt_lastmsg) goto rx_changed_settime; /* got a second frame inside a potential throttle period? */ if (ktime_us_delta(ktime_get(), op->kt_lastmsg) < ktime_to_us(op->kt_ival2)) { /* do not send the saved data - only start throttle timer */ hrtimer_start(&op->thrtimer, ktime_add(op->kt_lastmsg, op->kt_ival2), HRTIMER_MODE_ABS_SOFT); return; } /* the gap was that big, that throttling was not needed here */ rx_changed_settime: bcm_rx_changed(op, lastdata); op->kt_lastmsg = ktime_get(); } /* * bcm_rx_cmp_to_index - (bit)compares the currently received data to formerly * received data stored in op->last_frames[] */ static void bcm_rx_cmp_to_index(struct bcm_op *op, unsigned int index, const struct canfd_frame *rxdata, unsigned char traffic_flags) { struct canfd_frame *cf = op->frames + op->cfsiz * index; struct canfd_frame *lcf = op->last_frames + op->cfsiz * index; int i; /* * no one uses the MSBs of flags for comparison, * so we use it here to detect the first time of reception */ if (!(lcf->flags & RX_RECV)) { /* received data for the first time => send update to user */ bcm_rx_update_and_send(op, lcf, rxdata, traffic_flags); return; } /* do a real check in CAN frame data section */ for (i = 0; i < rxdata->len; i += 8) { if ((get_u64(cf, i) & get_u64(rxdata, i)) != (get_u64(cf, i) & get_u64(lcf, i))) { bcm_rx_update_and_send(op, lcf, rxdata, traffic_flags); return; } } if (op->flags & RX_CHECK_DLC) { /* do a real check in CAN frame length */ if (rxdata->len != lcf->len) { bcm_rx_update_and_send(op, lcf, rxdata, traffic_flags); return; } } } /* * bcm_rx_starttimer - enable timeout monitoring for CAN frame reception */ static void bcm_rx_starttimer(struct bcm_op *op) { if (op->flags & RX_NO_AUTOTIMER) return; if (op->kt_ival1) hrtimer_start(&op->timer, op->kt_ival1, HRTIMER_MODE_REL_SOFT); } /* bcm_rx_timeout_handler - when the (cyclic) CAN frame reception timed out */ static enum hrtimer_restart bcm_rx_timeout_handler(struct hrtimer *hrtimer) { struct bcm_op *op = container_of(hrtimer, struct bcm_op, timer); struct bcm_msg_head msg_head; /* if user wants to be informed, when cyclic CAN-Messages come back */ if ((op->flags & RX_ANNOUNCE_RESUME) && op->last_frames) { /* clear received CAN frames to indicate 'nothing received' */ memset(op->last_frames, 0, op->nframes * op->cfsiz); } /* create notification to user */ memset(&msg_head, 0, sizeof(msg_head)); msg_head.opcode = RX_TIMEOUT; msg_head.flags = op->flags; msg_head.count = op->count; msg_head.ival1 = op->ival1; msg_head.ival2 = op->ival2; msg_head.can_id = op->can_id; msg_head.nframes = 0; bcm_send_to_user(op, &msg_head, NULL, 0); return HRTIMER_NORESTART; } /* * bcm_rx_do_flush - helper for bcm_rx_thr_flush */ static inline int bcm_rx_do_flush(struct bcm_op *op, unsigned int index) { struct canfd_frame *lcf = op->last_frames + op->cfsiz * index; if ((op->last_frames) && (lcf->flags & RX_THR)) { bcm_rx_changed(op, lcf); return 1; } return 0; } /* * bcm_rx_thr_flush - Check for throttled data and send it to the userspace */ static int bcm_rx_thr_flush(struct bcm_op *op) { int updated = 0; if (op->nframes > 1) { unsigned int i; /* for MUX filter we start at index 1 */ for (i = 1; i < op->nframes; i++) updated += bcm_rx_do_flush(op, i); } else { /* for RX_FILTER_ID and simple filter */ updated += bcm_rx_do_flush(op, 0); } return updated; } /* * bcm_rx_thr_handler - the time for blocked content updates is over now: * Check for throttled data and send it to the userspace */ static enum hrtimer_restart bcm_rx_thr_handler(struct hrtimer *hrtimer) { struct bcm_op *op = container_of(hrtimer, struct bcm_op, thrtimer); if (bcm_rx_thr_flush(op)) { hrtimer_forward_now(hrtimer, op->kt_ival2); return HRTIMER_RESTART; } else { /* rearm throttle handling */ op->kt_lastmsg = 0; return HRTIMER_NORESTART; } } /* * bcm_rx_handler - handle a CAN frame reception */ static void bcm_rx_handler(struct sk_buff *skb, void *data) { struct bcm_op *op = (struct bcm_op *)data; const struct canfd_frame *rxframe = (struct canfd_frame *)skb->data; unsigned int i; unsigned char traffic_flags; if (op->can_id != rxframe->can_id) return; /* make sure to handle the correct frame type (CAN / CAN FD) */ if (op->flags & CAN_FD_FRAME) { if (!can_is_canfd_skb(skb)) return; } else { if (!can_is_can_skb(skb)) return; } /* disable timeout */ hrtimer_cancel(&op->timer); /* save rx timestamp */ op->rx_stamp = skb->tstamp; /* save originator for recvfrom() */ op->rx_ifindex = skb->dev->ifindex; /* update statistics */ op->frames_abs++; if (op->flags & RX_RTR_FRAME) { /* send reply for RTR-request (placed in op->frames[0]) */ bcm_can_tx(op); return; } /* compute flags to distinguish between own/local/remote CAN traffic */ traffic_flags = 0; if (skb->sk) { traffic_flags |= RX_LOCAL; if (skb->sk == op->sk) traffic_flags |= RX_OWN; } if (op->flags & RX_FILTER_ID) { /* the easiest case */ bcm_rx_update_and_send(op, op->last_frames, rxframe, traffic_flags); goto rx_starttimer; } if (op->nframes == 1) { /* simple compare with index 0 */ bcm_rx_cmp_to_index(op, 0, rxframe, traffic_flags); goto rx_starttimer; } if (op->nframes > 1) { /* * multiplex compare * * find the first multiplex mask that fits. * Remark: The MUX-mask is stored in index 0 - but only the * first 64 bits of the frame data[] are relevant (CAN FD) */ for (i = 1; i < op->nframes; i++) { if ((get_u64(op->frames, 0) & get_u64(rxframe, 0)) == (get_u64(op->frames, 0) & get_u64(op->frames + op->cfsiz * i, 0))) { bcm_rx_cmp_to_index(op, i, rxframe, traffic_flags); break; } } } rx_starttimer: bcm_rx_starttimer(op); } /* * helpers for bcm_op handling: find & delete bcm [rx|tx] op elements */ static struct bcm_op *bcm_find_op(struct list_head *ops, struct bcm_msg_head *mh, int ifindex) { struct bcm_op *op; list_for_each_entry(op, ops, list) { if ((op->can_id == mh->can_id) && (op->ifindex == ifindex) && (op->flags & CAN_FD_FRAME) == (mh->flags & CAN_FD_FRAME)) return op; } return NULL; } static void bcm_free_op_rcu(struct rcu_head *rcu_head) { struct bcm_op *op = container_of(rcu_head, struct bcm_op, rcu); if ((op->frames) && (op->frames != &op->sframe)) kfree(op->frames); if ((op->last_frames) && (op->last_frames != &op->last_sframe)) kfree(op->last_frames); kfree(op); } static void bcm_remove_op(struct bcm_op *op) { hrtimer_cancel(&op->timer); hrtimer_cancel(&op->thrtimer); call_rcu(&op->rcu, bcm_free_op_rcu); } static void bcm_rx_unreg(struct net_device *dev, struct bcm_op *op) { if (op->rx_reg_dev == dev) { can_rx_unregister(dev_net(dev), dev, op->can_id, REGMASK(op->can_id), bcm_rx_handler, op); /* mark as removed subscription */ op->rx_reg_dev = NULL; } else printk(KERN_ERR "can-bcm: bcm_rx_unreg: registered device " "mismatch %p %p\n", op->rx_reg_dev, dev); } /* * bcm_delete_rx_op - find and remove a rx op (returns number of removed ops) */ static int bcm_delete_rx_op(struct list_head *ops, struct bcm_msg_head *mh, int ifindex) { struct bcm_op *op, *n; list_for_each_entry_safe(op, n, ops, list) { if ((op->can_id == mh->can_id) && (op->ifindex == ifindex) && (op->flags & CAN_FD_FRAME) == (mh->flags & CAN_FD_FRAME)) { /* disable automatic timer on frame reception */ op->flags |= RX_NO_AUTOTIMER; /* * Don't care if we're bound or not (due to netdev * problems) can_rx_unregister() is always a save * thing to do here. */ if (op->ifindex) { /* * Only remove subscriptions that had not * been removed due to NETDEV_UNREGISTER * in bcm_notifier() */ if (op->rx_reg_dev) { struct net_device *dev; dev = dev_get_by_index(sock_net(op->sk), op->ifindex); if (dev) { bcm_rx_unreg(dev, op); dev_put(dev); } } } else can_rx_unregister(sock_net(op->sk), NULL, op->can_id, REGMASK(op->can_id), bcm_rx_handler, op); list_del_rcu(&op->list); bcm_remove_op(op); return 1; /* done */ } } return 0; /* not found */ } /* * bcm_delete_tx_op - find and remove a tx op (returns number of removed ops) */ static int bcm_delete_tx_op(struct list_head *ops, struct bcm_msg_head *mh, int ifindex) { struct bcm_op *op, *n; list_for_each_entry_safe(op, n, ops, list) { if ((op->can_id == mh->can_id) && (op->ifindex == ifindex) && (op->flags & CAN_FD_FRAME) == (mh->flags & CAN_FD_FRAME)) { list_del_rcu(&op->list); bcm_remove_op(op); return 1; /* done */ } } return 0; /* not found */ } /* * bcm_read_op - read out a bcm_op and send it to the user (for bcm_sendmsg) */ static int bcm_read_op(struct list_head *ops, struct bcm_msg_head *msg_head, int ifindex) { struct bcm_op *op = bcm_find_op(ops, msg_head, ifindex); if (!op) return -EINVAL; /* put current values into msg_head */ msg_head->flags = op->flags; msg_head->count = op->count; msg_head->ival1 = op->ival1; msg_head->ival2 = op->ival2; msg_head->nframes = op->nframes; bcm_send_to_user(op, msg_head, op->frames, 0); return MHSIZ; } /* * bcm_tx_setup - create or update a bcm tx op (for bcm_sendmsg) */ static int bcm_tx_setup(struct bcm_msg_head *msg_head, struct msghdr *msg, int ifindex, struct sock *sk) { struct bcm_sock *bo = bcm_sk(sk); struct bcm_op *op; struct canfd_frame *cf; unsigned int i; int err; /* we need a real device to send frames */ if (!ifindex) return -ENODEV; /* check nframes boundaries - we need at least one CAN frame */ if (msg_head->nframes < 1 || msg_head->nframes > MAX_NFRAMES) return -EINVAL; /* check timeval limitations */ if ((msg_head->flags & SETTIMER) && bcm_is_invalid_tv(msg_head)) return -EINVAL; /* check the given can_id */ op = bcm_find_op(&bo->tx_ops, msg_head, ifindex); if (op) { /* update existing BCM operation */ /* * Do we need more space for the CAN frames than currently * allocated? -> This is a _really_ unusual use-case and * therefore (complexity / locking) it is not supported. */ if (msg_head->nframes > op->nframes) return -E2BIG; /* update CAN frames content */ for (i = 0; i < msg_head->nframes; i++) { cf = op->frames + op->cfsiz * i; err = memcpy_from_msg((u8 *)cf, msg, op->cfsiz); if (op->flags & CAN_FD_FRAME) { if (cf->len > 64) err = -EINVAL; } else { if (cf->len > 8) err = -EINVAL; } if (err < 0) return err; if (msg_head->flags & TX_CP_CAN_ID) { /* copy can_id into frame */ cf->can_id = msg_head->can_id; } } op->flags = msg_head->flags; /* only lock for unlikely count/nframes/currframe changes */ if (op->nframes != msg_head->nframes || op->flags & TX_RESET_MULTI_IDX || op->flags & SETTIMER) { spin_lock_bh(&op->bcm_tx_lock); if (op->nframes != msg_head->nframes || op->flags & TX_RESET_MULTI_IDX) { /* potentially update changed nframes */ op->nframes = msg_head->nframes; /* restart multiple frame transmission */ op->currframe = 0; } if (op->flags & SETTIMER) op->count = msg_head->count; spin_unlock_bh(&op->bcm_tx_lock); } } else { /* insert new BCM operation for the given can_id */ op = kzalloc(OPSIZ, GFP_KERNEL); if (!op) return -ENOMEM; spin_lock_init(&op->bcm_tx_lock); op->can_id = msg_head->can_id; op->cfsiz = CFSIZ(msg_head->flags); op->flags = msg_head->flags; op->nframes = msg_head->nframes; if (op->flags & SETTIMER) op->count = msg_head->count; /* create array for CAN frames and copy the data */ if (msg_head->nframes > 1) { op->frames = kmalloc_array(msg_head->nframes, op->cfsiz, GFP_KERNEL); if (!op->frames) { kfree(op); return -ENOMEM; } } else op->frames = &op->sframe; for (i = 0; i < msg_head->nframes; i++) { cf = op->frames + op->cfsiz * i; err = memcpy_from_msg((u8 *)cf, msg, op->cfsiz); if (err < 0) goto free_op; if (op->flags & CAN_FD_FRAME) { if (cf->len > 64) err = -EINVAL; } else { if (cf->len > 8) err = -EINVAL; } if (err < 0) goto free_op; if (msg_head->flags & TX_CP_CAN_ID) { /* copy can_id into frame */ cf->can_id = msg_head->can_id; } } /* tx_ops never compare with previous received messages */ op->last_frames = NULL; /* bcm_can_tx / bcm_tx_timeout_handler needs this */ op->sk = sk; op->ifindex = ifindex; /* initialize uninitialized (kzalloc) structure */ hrtimer_setup(&op->timer, bcm_tx_timeout_handler, CLOCK_MONOTONIC, HRTIMER_MODE_REL_SOFT); /* currently unused in tx_ops */ hrtimer_setup(&op->thrtimer, hrtimer_dummy_timeout, CLOCK_MONOTONIC, HRTIMER_MODE_REL_SOFT); /* add this bcm_op to the list of the tx_ops */ list_add(&op->list, &bo->tx_ops); } /* if ((op = bcm_find_op(&bo->tx_ops, msg_head->can_id, ifindex))) */ if (op->flags & SETTIMER) { /* set timer values */ op->ival1 = msg_head->ival1; op->ival2 = msg_head->ival2; op->kt_ival1 = bcm_timeval_to_ktime(msg_head->ival1); op->kt_ival2 = bcm_timeval_to_ktime(msg_head->ival2); /* disable an active timer due to zero values? */ if (!op->kt_ival1 && !op->kt_ival2) hrtimer_cancel(&op->timer); } if (op->flags & STARTTIMER) { hrtimer_cancel(&op->timer); /* spec: send CAN frame when starting timer */ op->flags |= TX_ANNOUNCE; } if (op->flags & TX_ANNOUNCE) bcm_can_tx(op); if (op->flags & STARTTIMER) bcm_tx_start_timer(op); return msg_head->nframes * op->cfsiz + MHSIZ; free_op: if (op->frames != &op->sframe) kfree(op->frames); kfree(op); return err; } /* * bcm_rx_setup - create or update a bcm rx op (for bcm_sendmsg) */ static int bcm_rx_setup(struct bcm_msg_head *msg_head, struct msghdr *msg, int ifindex, struct sock *sk) { struct bcm_sock *bo = bcm_sk(sk); struct bcm_op *op; int do_rx_register; int err = 0; if ((msg_head->flags & RX_FILTER_ID) || (!(msg_head->nframes))) { /* be robust against wrong usage ... */ msg_head->flags |= RX_FILTER_ID; /* ignore trailing garbage */ msg_head->nframes = 0; } /* the first element contains the mux-mask => MAX_NFRAMES + 1 */ if (msg_head->nframes > MAX_NFRAMES + 1) return -EINVAL; if ((msg_head->flags & RX_RTR_FRAME) && ((msg_head->nframes != 1) || (!(msg_head->can_id & CAN_RTR_FLAG)))) return -EINVAL; /* check timeval limitations */ if ((msg_head->flags & SETTIMER) && bcm_is_invalid_tv(msg_head)) return -EINVAL; /* check the given can_id */ op = bcm_find_op(&bo->rx_ops, msg_head, ifindex); if (op) { /* update existing BCM operation */ /* * Do we need more space for the CAN frames than currently * allocated? -> This is a _really_ unusual use-case and * therefore (complexity / locking) it is not supported. */ if (msg_head->nframes > op->nframes) return -E2BIG; if (msg_head->nframes) { /* update CAN frames content */ err = memcpy_from_msg(op->frames, msg, msg_head->nframes * op->cfsiz); if (err < 0) return err; /* clear last_frames to indicate 'nothing received' */ memset(op->last_frames, 0, msg_head->nframes * op->cfsiz); } op->nframes = msg_head->nframes; op->flags = msg_head->flags; /* Only an update -> do not call can_rx_register() */ do_rx_register = 0; } else { /* insert new BCM operation for the given can_id */ op = kzalloc(OPSIZ, GFP_KERNEL); if (!op) return -ENOMEM; op->can_id = msg_head->can_id; op->nframes = msg_head->nframes; op->cfsiz = CFSIZ(msg_head->flags); op->flags = msg_head->flags; if (msg_head->nframes > 1) { /* create array for CAN frames and copy the data */ op->frames = kmalloc_array(msg_head->nframes, op->cfsiz, GFP_KERNEL); if (!op->frames) { kfree(op); return -ENOMEM; } /* create and init array for received CAN frames */ op->last_frames = kcalloc(msg_head->nframes, op->cfsiz, GFP_KERNEL); if (!op->last_frames) { kfree(op->frames); kfree(op); return -ENOMEM; } } else { op->frames = &op->sframe; op->last_frames = &op->last_sframe; } if (msg_head->nframes) { err = memcpy_from_msg(op->frames, msg, msg_head->nframes * op->cfsiz); if (err < 0) { if (op->frames != &op->sframe) kfree(op->frames); if (op->last_frames != &op->last_sframe) kfree(op->last_frames); kfree(op); return err; } } /* bcm_can_tx / bcm_tx_timeout_handler needs this */ op->sk = sk; op->ifindex = ifindex; /* ifindex for timeout events w/o previous frame reception */ op->rx_ifindex = ifindex; /* initialize uninitialized (kzalloc) structure */ hrtimer_setup(&op->timer, bcm_rx_timeout_handler, CLOCK_MONOTONIC, HRTIMER_MODE_REL_SOFT); hrtimer_setup(&op->thrtimer, bcm_rx_thr_handler, CLOCK_MONOTONIC, HRTIMER_MODE_REL_SOFT); /* add this bcm_op to the list of the rx_ops */ list_add(&op->list, &bo->rx_ops); /* call can_rx_register() */ do_rx_register = 1; } /* if ((op = bcm_find_op(&bo->rx_ops, msg_head->can_id, ifindex))) */ /* check flags */ if (op->flags & RX_RTR_FRAME) { struct canfd_frame *frame0 = op->frames; /* no timers in RTR-mode */ hrtimer_cancel(&op->thrtimer); hrtimer_cancel(&op->timer); /* * funny feature in RX(!)_SETUP only for RTR-mode: * copy can_id into frame BUT without RTR-flag to * prevent a full-load-loopback-test ... ;-] */ if ((op->flags & TX_CP_CAN_ID) || (frame0->can_id == op->can_id)) frame0->can_id = op->can_id & ~CAN_RTR_FLAG; } else { if (op->flags & SETTIMER) { /* set timer value */ op->ival1 = msg_head->ival1; op->ival2 = msg_head->ival2; op->kt_ival1 = bcm_timeval_to_ktime(msg_head->ival1); op->kt_ival2 = bcm_timeval_to_ktime(msg_head->ival2); /* disable an active timer due to zero value? */ if (!op->kt_ival1) hrtimer_cancel(&op->timer); /* * In any case cancel the throttle timer, flush * potentially blocked msgs and reset throttle handling */ op->kt_lastmsg = 0; hrtimer_cancel(&op->thrtimer); bcm_rx_thr_flush(op); } if ((op->flags & STARTTIMER) && op->kt_ival1) hrtimer_start(&op->timer, op->kt_ival1, HRTIMER_MODE_REL_SOFT); } /* now we can register for can_ids, if we added a new bcm_op */ if (do_rx_register) { if (ifindex) { struct net_device *dev; dev = dev_get_by_index(sock_net(sk), ifindex); if (dev) { err = can_rx_register(sock_net(sk), dev, op->can_id, REGMASK(op->can_id), bcm_rx_handler, op, "bcm", sk); op->rx_reg_dev = dev; dev_put(dev); } } else err = can_rx_register(sock_net(sk), NULL, op->can_id, REGMASK(op->can_id), bcm_rx_handler, op, "bcm", sk); if (err) { /* this bcm rx op is broken -> remove it */ list_del_rcu(&op->list); bcm_remove_op(op); return err; } } return msg_head->nframes * op->cfsiz + MHSIZ; } /* * bcm_tx_send - send a single CAN frame to the CAN interface (for bcm_sendmsg) */ static int bcm_tx_send(struct msghdr *msg, int ifindex, struct sock *sk, int cfsiz) { struct sk_buff *skb; struct net_device *dev; int err; /* we need a real device to send frames */ if (!ifindex) return -ENODEV; skb = alloc_skb(cfsiz + sizeof(struct can_skb_priv), GFP_KERNEL); if (!skb) return -ENOMEM; can_skb_reserve(skb); err = memcpy_from_msg(skb_put(skb, cfsiz), msg, cfsiz); if (err < 0) { kfree_skb(skb); return err; } dev = dev_get_by_index(sock_net(sk), ifindex); if (!dev) { kfree_skb(skb); return -ENODEV; } can_skb_prv(skb)->ifindex = dev->ifindex; can_skb_prv(skb)->skbcnt = 0; skb->dev = dev; can_skb_set_owner(skb, sk); err = can_send(skb, 1); /* send with loopback */ dev_put(dev); if (err) return err; return cfsiz + MHSIZ; } /* * bcm_sendmsg - process BCM commands (opcodes) from the userspace */ static int bcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t size) { struct sock *sk = sock->sk; struct bcm_sock *bo = bcm_sk(sk); int ifindex = bo->ifindex; /* default ifindex for this bcm_op */ struct bcm_msg_head msg_head; int cfsiz; int ret; /* read bytes or error codes as return value */ if (!bo->bound) return -ENOTCONN; /* check for valid message length from userspace */ if (size < MHSIZ) return -EINVAL; /* read message head information */ ret = memcpy_from_msg((u8 *)&msg_head, msg, MHSIZ); if (ret < 0) return ret; cfsiz = CFSIZ(msg_head.flags); if ((size - MHSIZ) % cfsiz) return -EINVAL; /* check for alternative ifindex for this bcm_op */ if (!ifindex && msg->msg_name) { /* no bound device as default => check msg_name */ DECLARE_SOCKADDR(struct sockaddr_can *, addr, msg->msg_name); if (msg->msg_namelen < BCM_MIN_NAMELEN) return -EINVAL; if (addr->can_family != AF_CAN) return -EINVAL; /* ifindex from sendto() */ ifindex = addr->can_ifindex; if (ifindex) { struct net_device *dev; dev = dev_get_by_index(sock_net(sk), ifindex); if (!dev) return -ENODEV; if (dev->type != ARPHRD_CAN) { dev_put(dev); return -ENODEV; } dev_put(dev); } } lock_sock(sk); switch (msg_head.opcode) { case TX_SETUP: ret = bcm_tx_setup(&msg_head, msg, ifindex, sk); break; case RX_SETUP: ret = bcm_rx_setup(&msg_head, msg, ifindex, sk); break; case TX_DELETE: if (bcm_delete_tx_op(&bo->tx_ops, &msg_head, ifindex)) ret = MHSIZ; else ret = -EINVAL; break; case RX_DELETE: if (bcm_delete_rx_op(&bo->rx_ops, &msg_head, ifindex)) ret = MHSIZ; else ret = -EINVAL; break; case TX_READ: /* reuse msg_head for the reply to TX_READ */ msg_head.opcode = TX_STATUS; ret = bcm_read_op(&bo->tx_ops, &msg_head, ifindex); break; case RX_READ: /* reuse msg_head for the reply to RX_READ */ msg_head.opcode = RX_STATUS; ret = bcm_read_op(&bo->rx_ops, &msg_head, ifindex); break; case TX_SEND: /* we need exactly one CAN frame behind the msg head */ if ((msg_head.nframes != 1) || (size != cfsiz + MHSIZ)) ret = -EINVAL; else ret = bcm_tx_send(msg, ifindex, sk, cfsiz); break; default: ret = -EINVAL; break; } release_sock(sk); return ret; } /* * notification handler for netdevice status changes */ static void bcm_notify(struct bcm_sock *bo, unsigned long msg, struct net_device *dev) { struct sock *sk = &bo->sk; struct bcm_op *op; int notify_enodev = 0; if (!net_eq(dev_net(dev), sock_net(sk))) return; switch (msg) { case NETDEV_UNREGISTER: lock_sock(sk); /* remove device specific receive entries */ list_for_each_entry(op, &bo->rx_ops, list) if (op->rx_reg_dev == dev) bcm_rx_unreg(dev, op); /* remove device reference, if this is our bound device */ if (bo->bound && bo->ifindex == dev->ifindex) { #if IS_ENABLED(CONFIG_PROC_FS) if (sock_net(sk)->can.bcmproc_dir && bo->bcm_proc_read) { remove_proc_entry(bo->procname, sock_net(sk)->can.bcmproc_dir); bo->bcm_proc_read = NULL; } #endif bo->bound = 0; bo->ifindex = 0; notify_enodev = 1; } release_sock(sk); if (notify_enodev) { sk->sk_err = ENODEV; if (!sock_flag(sk, SOCK_DEAD)) sk_error_report(sk); } break; case NETDEV_DOWN: if (bo->bound && bo->ifindex == dev->ifindex) { sk->sk_err = ENETDOWN; if (!sock_flag(sk, SOCK_DEAD)) sk_error_report(sk); } } } static int bcm_notifier(struct notifier_block *nb, unsigned long msg, void *ptr) { struct net_device *dev = netdev_notifier_info_to_dev(ptr); if (dev->type != ARPHRD_CAN) return NOTIFY_DONE; if (msg != NETDEV_UNREGISTER && msg != NETDEV_DOWN) return NOTIFY_DONE; if (unlikely(bcm_busy_notifier)) /* Check for reentrant bug. */ return NOTIFY_DONE; spin_lock(&bcm_notifier_lock); list_for_each_entry(bcm_busy_notifier, &bcm_notifier_list, notifier) { spin_unlock(&bcm_notifier_lock); bcm_notify(bcm_busy_notifier, msg, dev); spin_lock(&bcm_notifier_lock); } bcm_busy_notifier = NULL; spin_unlock(&bcm_notifier_lock); return NOTIFY_DONE; } /* * initial settings for all BCM sockets to be set at socket creation time */ static int bcm_init(struct sock *sk) { struct bcm_sock *bo = bcm_sk(sk); bo->bound = 0; bo->ifindex = 0; bo->dropped_usr_msgs = 0; bo->bcm_proc_read = NULL; INIT_LIST_HEAD(&bo->tx_ops); INIT_LIST_HEAD(&bo->rx_ops); /* set notifier */ spin_lock(&bcm_notifier_lock); list_add_tail(&bo->notifier, &bcm_notifier_list); spin_unlock(&bcm_notifier_lock); return 0; } /* * standard socket functions */ static int bcm_release(struct socket *sock) { struct sock *sk = sock->sk; struct net *net; struct bcm_sock *bo; struct bcm_op *op, *next; if (!sk) return 0; net = sock_net(sk); bo = bcm_sk(sk); /* remove bcm_ops, timer, rx_unregister(), etc. */ spin_lock(&bcm_notifier_lock); while (bcm_busy_notifier == bo) { spin_unlock(&bcm_notifier_lock); schedule_timeout_uninterruptible(1); spin_lock(&bcm_notifier_lock); } list_del(&bo->notifier); spin_unlock(&bcm_notifier_lock); lock_sock(sk); #if IS_ENABLED(CONFIG_PROC_FS) /* remove procfs entry */ if (net->can.bcmproc_dir && bo->bcm_proc_read) remove_proc_entry(bo->procname, net->can.bcmproc_dir); #endif /* CONFIG_PROC_FS */ list_for_each_entry_safe(op, next, &bo->tx_ops, list) bcm_remove_op(op); list_for_each_entry_safe(op, next, &bo->rx_ops, list) { /* * Don't care if we're bound or not (due to netdev problems) * can_rx_unregister() is always a save thing to do here. */ if (op->ifindex) { /* * Only remove subscriptions that had not * been removed due to NETDEV_UNREGISTER * in bcm_notifier() */ if (op->rx_reg_dev) { struct net_device *dev; dev = dev_get_by_index(net, op->ifindex); if (dev) { bcm_rx_unreg(dev, op); dev_put(dev); } } } else can_rx_unregister(net, NULL, op->can_id, REGMASK(op->can_id), bcm_rx_handler, op); } synchronize_rcu(); list_for_each_entry_safe(op, next, &bo->rx_ops, list) bcm_remove_op(op); /* remove device reference */ if (bo->bound) { bo->bound = 0; bo->ifindex = 0; } sock_orphan(sk); sock->sk = NULL; release_sock(sk); sock_prot_inuse_add(net, sk->sk_prot, -1); sock_put(sk); return 0; } static int bcm_connect(struct socket *sock, struct sockaddr *uaddr, int len, int flags) { struct sockaddr_can *addr = (struct sockaddr_can *)uaddr; struct sock *sk = sock->sk; struct bcm_sock *bo = bcm_sk(sk); struct net *net = sock_net(sk); int ret = 0; if (len < BCM_MIN_NAMELEN) return -EINVAL; lock_sock(sk); if (bo->bound) { ret = -EISCONN; goto fail; } /* bind a device to this socket */ if (addr->can_ifindex) { struct net_device *dev; dev = dev_get_by_index(net, addr->can_ifindex); if (!dev) { ret = -ENODEV; goto fail; } if (dev->type != ARPHRD_CAN) { dev_put(dev); ret = -ENODEV; goto fail; } bo->ifindex = dev->ifindex; dev_put(dev); } else { /* no interface reference for ifindex = 0 ('any' CAN device) */ bo->ifindex = 0; } #if IS_ENABLED(CONFIG_PROC_FS) if (net->can.bcmproc_dir) { /* unique socket address as filename */ sprintf(bo->procname, "%lu", sock_i_ino(sk)); bo->bcm_proc_read = proc_create_net_single(bo->procname, 0644, net->can.bcmproc_dir, bcm_proc_show, sk); if (!bo->bcm_proc_read) { ret = -ENOMEM; goto fail; } } #endif /* CONFIG_PROC_FS */ bo->bound = 1; fail: release_sock(sk); return ret; } static int bcm_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, int flags) { struct sock *sk = sock->sk; struct sk_buff *skb; int error = 0; int err; skb = skb_recv_datagram(sk, flags, &error); if (!skb) return error; if (skb->len < size) size = skb->len; err = memcpy_to_msg(msg, skb->data, size); if (err < 0) { skb_free_datagram(sk, skb); return err; } sock_recv_cmsgs(msg, sk, skb); if (msg->msg_name) { __sockaddr_check_size(BCM_MIN_NAMELEN); msg->msg_namelen = BCM_MIN_NAMELEN; memcpy(msg->msg_name, skb->cb, msg->msg_namelen); } /* assign the flags that have been recorded in bcm_send_to_user() */ msg->msg_flags |= *(bcm_flags(skb)); skb_free_datagram(sk, skb); return size; } static int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd, unsigned long arg) { /* no ioctls for socket layer -> hand it down to NIC layer */ return -ENOIOCTLCMD; } static const struct proto_ops bcm_ops = { .family = PF_CAN, .release = bcm_release, .bind = sock_no_bind, .connect = bcm_connect, .socketpair = sock_no_socketpair, .accept = sock_no_accept, .getname = sock_no_getname, .poll = datagram_poll, .ioctl = bcm_sock_no_ioctlcmd, .gettstamp = sock_gettstamp, .listen = sock_no_listen, .shutdown = sock_no_shutdown, .sendmsg = bcm_sendmsg, .recvmsg = bcm_recvmsg, .mmap = sock_no_mmap, }; static struct proto bcm_proto __read_mostly = { .name = "CAN_BCM", .owner = THIS_MODULE, .obj_size = sizeof(struct bcm_sock), .init = bcm_init, }; static const struct can_proto bcm_can_proto = { .type = SOCK_DGRAM, .protocol = CAN_BCM, .ops = &bcm_ops, .prot = &bcm_proto, }; static int canbcm_pernet_init(struct net *net) { #if IS_ENABLED(CONFIG_PROC_FS) /* create /proc/net/can-bcm directory */ net->can.bcmproc_dir = proc_net_mkdir(net, "can-bcm", net->proc_net); #endif /* CONFIG_PROC_FS */ return 0; } static void canbcm_pernet_exit(struct net *net) { #if IS_ENABLED(CONFIG_PROC_FS) /* remove /proc/net/can-bcm directory */ if (net->can.bcmproc_dir) remove_proc_entry("can-bcm", net->proc_net); #endif /* CONFIG_PROC_FS */ } static struct pernet_operations canbcm_pernet_ops __read_mostly = { .init = canbcm_pernet_init, .exit = canbcm_pernet_exit, }; static struct notifier_block canbcm_notifier = { .notifier_call = bcm_notifier }; static int __init bcm_module_init(void) { int err; pr_info("can: broadcast manager protocol\n"); err = register_pernet_subsys(&canbcm_pernet_ops); if (err) return err; err = register_netdevice_notifier(&canbcm_notifier); if (err) goto register_notifier_failed; err = can_proto_register(&bcm_can_proto); if (err < 0) { printk(KERN_ERR "can: registration of bcm protocol failed\n"); goto register_proto_failed; } return 0; register_proto_failed: unregister_netdevice_notifier(&canbcm_notifier); register_notifier_failed: unregister_pernet_subsys(&canbcm_pernet_ops); return err; } static void __exit bcm_module_exit(void) { can_proto_unregister(&bcm_can_proto); unregister_netdevice_notifier(&canbcm_notifier); unregister_pernet_subsys(&canbcm_pernet_ops); } module_init(bcm_module_init); module_exit(bcm_module_exit); |
| 10 10 10 10 2795 11254 11243 2795 11250 416 418 418 417 1124 1123 1123 1123 1125 1126 1120 1125 369 369 369 368 369 369 367 369 108 108 108 143 142 143 141 141 141 99 99 98 99 99 143 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 | // SPDX-License-Identifier: GPL-2.0 /* * Fast batching percpu counters. */ #include <linux/percpu_counter.h> #include <linux/mutex.h> #include <linux/init.h> #include <linux/cpu.h> #include <linux/module.h> #include <linux/debugobjects.h> #ifdef CONFIG_HOTPLUG_CPU static LIST_HEAD(percpu_counters); static DEFINE_SPINLOCK(percpu_counters_lock); #endif #ifdef CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER static const struct debug_obj_descr percpu_counter_debug_descr; static bool percpu_counter_fixup_free(void *addr, enum debug_obj_state state) { struct percpu_counter *fbc = addr; switch (state) { case ODEBUG_STATE_ACTIVE: percpu_counter_destroy(fbc); debug_object_free(fbc, &percpu_counter_debug_descr); return true; default: return false; } } static const struct debug_obj_descr percpu_counter_debug_descr = { .name = "percpu_counter", .fixup_free = percpu_counter_fixup_free, }; static inline void debug_percpu_counter_activate(struct percpu_counter *fbc) { debug_object_init(fbc, &percpu_counter_debug_descr); debug_object_activate(fbc, &percpu_counter_debug_descr); } static inline void debug_percpu_counter_deactivate(struct percpu_counter *fbc) { debug_object_deactivate(fbc, &percpu_counter_debug_descr); debug_object_free(fbc, &percpu_counter_debug_descr); } #else /* CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER */ static inline void debug_percpu_counter_activate(struct percpu_counter *fbc) { } static inline void debug_percpu_counter_deactivate(struct percpu_counter *fbc) { } #endif /* CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER */ void percpu_counter_set(struct percpu_counter *fbc, s64 amount) { int cpu; unsigned long flags; raw_spin_lock_irqsave(&fbc->lock, flags); for_each_possible_cpu(cpu) { s32 *pcount = per_cpu_ptr(fbc->counters, cpu); *pcount = 0; } fbc->count = amount; raw_spin_unlock_irqrestore(&fbc->lock, flags); } EXPORT_SYMBOL(percpu_counter_set); /* * Add to a counter while respecting batch size. * * There are 2 implementations, both dealing with the following problem: * * The decision slow path/fast path and the actual update must be atomic. * Otherwise a call in process context could check the current values and * decide that the fast path can be used. If now an interrupt occurs before * the this_cpu_add(), and the interrupt updates this_cpu(*fbc->counters), * then the this_cpu_add() that is executed after the interrupt has completed * can produce values larger than "batch" or even overflows. */ #ifdef CONFIG_HAVE_CMPXCHG_LOCAL /* * Safety against interrupts is achieved in 2 ways: * 1. the fast path uses local cmpxchg (note: no lock prefix) * 2. the slow path operates with interrupts disabled */ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch) { s64 count; unsigned long flags; count = this_cpu_read(*fbc->counters); do { if (unlikely(abs(count + amount) >= batch)) { raw_spin_lock_irqsave(&fbc->lock, flags); /* * Note: by now we might have migrated to another CPU * or the value might have changed. */ count = __this_cpu_read(*fbc->counters); fbc->count += count + amount; __this_cpu_sub(*fbc->counters, count); raw_spin_unlock_irqrestore(&fbc->lock, flags); return; } } while (!this_cpu_try_cmpxchg(*fbc->counters, &count, count + amount)); } #else /* * local_irq_save() is used to make the function irq safe: * - The slow path would be ok as protected by an irq-safe spinlock. * - this_cpu_add would be ok as it is irq-safe by definition. */ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch) { s64 count; unsigned long flags; local_irq_save(flags); count = __this_cpu_read(*fbc->counters) + amount; if (abs(count) >= batch) { raw_spin_lock(&fbc->lock); fbc->count += count; __this_cpu_sub(*fbc->counters, count - amount); raw_spin_unlock(&fbc->lock); } else { this_cpu_add(*fbc->counters, amount); } local_irq_restore(flags); } #endif EXPORT_SYMBOL(percpu_counter_add_batch); /* * For percpu_counter with a big batch, the devication of its count could * be big, and there is requirement to reduce the deviation, like when the * counter's batch could be runtime decreased to get a better accuracy, * which can be achieved by running this sync function on each CPU. */ void percpu_counter_sync(struct percpu_counter *fbc) { unsigned long flags; s64 count; raw_spin_lock_irqsave(&fbc->lock, flags); count = __this_cpu_read(*fbc->counters); fbc->count += count; __this_cpu_sub(*fbc->counters, count); raw_spin_unlock_irqrestore(&fbc->lock, flags); } EXPORT_SYMBOL(percpu_counter_sync); /* * Add up all the per-cpu counts, return the result. This is a more accurate * but much slower version of percpu_counter_read_positive(). * * We use the cpu mask of (cpu_online_mask | cpu_dying_mask) to capture sums * from CPUs that are in the process of being taken offline. Dying cpus have * been removed from the online mask, but may not have had the hotplug dead * notifier called to fold the percpu count back into the global counter sum. * By including dying CPUs in the iteration mask, we avoid this race condition * so __percpu_counter_sum() just does the right thing when CPUs are being taken * offline. */ s64 __percpu_counter_sum(struct percpu_counter *fbc) { s64 ret; int cpu; unsigned long flags; raw_spin_lock_irqsave(&fbc->lock, flags); ret = fbc->count; for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) { s32 *pcount = per_cpu_ptr(fbc->counters, cpu); ret += *pcount; } raw_spin_unlock_irqrestore(&fbc->lock, flags); return ret; } EXPORT_SYMBOL(__percpu_counter_sum); int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount, gfp_t gfp, u32 nr_counters, struct lock_class_key *key) { unsigned long flags __maybe_unused; size_t counter_size; s32 __percpu *counters; u32 i; counter_size = ALIGN(sizeof(*counters), __alignof__(*counters)); counters = __alloc_percpu_gfp(nr_counters * counter_size, __alignof__(*counters), gfp); if (!counters) { fbc[0].counters = NULL; return -ENOMEM; } for (i = 0; i < nr_counters; i++) { raw_spin_lock_init(&fbc[i].lock); lockdep_set_class(&fbc[i].lock, key); #ifdef CONFIG_HOTPLUG_CPU INIT_LIST_HEAD(&fbc[i].list); #endif fbc[i].count = amount; fbc[i].counters = (void __percpu *)counters + i * counter_size; debug_percpu_counter_activate(&fbc[i]); } #ifdef CONFIG_HOTPLUG_CPU spin_lock_irqsave(&percpu_counters_lock, flags); for (i = 0; i < nr_counters; i++) list_add(&fbc[i].list, &percpu_counters); spin_unlock_irqrestore(&percpu_counters_lock, flags); #endif return 0; } EXPORT_SYMBOL(__percpu_counter_init_many); void percpu_counter_destroy_many(struct percpu_counter *fbc, u32 nr_counters) { unsigned long flags __maybe_unused; u32 i; if (WARN_ON_ONCE(!fbc)) return; if (!fbc[0].counters) return; for (i = 0; i < nr_counters; i++) debug_percpu_counter_deactivate(&fbc[i]); #ifdef CONFIG_HOTPLUG_CPU spin_lock_irqsave(&percpu_counters_lock, flags); for (i = 0; i < nr_counters; i++) list_del(&fbc[i].list); spin_unlock_irqrestore(&percpu_counters_lock, flags); #endif free_percpu(fbc[0].counters); for (i = 0; i < nr_counters; i++) fbc[i].counters = NULL; } EXPORT_SYMBOL(percpu_counter_destroy_many); int percpu_counter_batch __read_mostly = 32; EXPORT_SYMBOL(percpu_counter_batch); static int compute_batch_value(unsigned int cpu) { int nr = num_online_cpus(); percpu_counter_batch = max(32, nr*2); return 0; } static int percpu_counter_cpu_dead(unsigned int cpu) { #ifdef CONFIG_HOTPLUG_CPU struct percpu_counter *fbc; compute_batch_value(cpu); spin_lock_irq(&percpu_counters_lock); list_for_each_entry(fbc, &percpu_counters, list) { s32 *pcount; raw_spin_lock(&fbc->lock); pcount = per_cpu_ptr(fbc->counters, cpu); fbc->count += *pcount; *pcount = 0; raw_spin_unlock(&fbc->lock); } spin_unlock_irq(&percpu_counters_lock); #endif return 0; } /* * Compare counter against given value. * Return 1 if greater, 0 if equal and -1 if less */ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch) { s64 count; count = percpu_counter_read(fbc); /* Check to see if rough count will be sufficient for comparison */ if (abs(count - rhs) > (batch * num_online_cpus())) { if (count > rhs) return 1; else return -1; } /* Need to use precise count */ count = percpu_counter_sum(fbc); if (count > rhs) return 1; else if (count < rhs) return -1; else return 0; } EXPORT_SYMBOL(__percpu_counter_compare); /* * Compare counter, and add amount if total is: less than or equal to limit if * amount is positive, or greater than or equal to limit if amount is negative. * Return true if amount is added, or false if total would be beyond the limit. * * Negative limit is allowed, but unusual. * When negative amounts (subs) are given to percpu_counter_limited_add(), * the limit would most naturally be 0 - but other limits are also allowed. * * Overflow beyond S64_MAX is not allowed for: counter, limit and amount * are all assumed to be sane (far from S64_MIN and S64_MAX). */ bool __percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit, s64 amount, s32 batch) { s64 count; s64 unknown; unsigned long flags; bool good = false; if (amount == 0) return true; local_irq_save(flags); unknown = batch * num_online_cpus(); count = __this_cpu_read(*fbc->counters); /* Skip taking the lock when safe */ if (abs(count + amount) <= batch && ((amount > 0 && fbc->count + unknown <= limit) || (amount < 0 && fbc->count - unknown >= limit))) { this_cpu_add(*fbc->counters, amount); local_irq_restore(flags); return true; } raw_spin_lock(&fbc->lock); count = fbc->count + amount; /* Skip percpu_counter_sum() when safe */ if (amount > 0) { if (count - unknown > limit) goto out; if (count + unknown <= limit) good = true; } else { if (count + unknown < limit) goto out; if (count - unknown >= limit) good = true; } if (!good) { s32 *pcount; int cpu; for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) { pcount = per_cpu_ptr(fbc->counters, cpu); count += *pcount; } if (amount > 0) { if (count > limit) goto out; } else { if (count < limit) goto out; } good = true; } count = __this_cpu_read(*fbc->counters); fbc->count += count + amount; __this_cpu_sub(*fbc->counters, count); out: raw_spin_unlock(&fbc->lock); local_irq_restore(flags); return good; } static int __init percpu_counter_startup(void) { int ret; ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "lib/percpu_cnt:online", compute_batch_value, NULL); WARN_ON(ret < 0); ret = cpuhp_setup_state_nocalls(CPUHP_PERCPU_CNT_DEAD, "lib/percpu_cnt:dead", NULL, percpu_counter_cpu_dead); WARN_ON(ret < 0); return 0; } module_init(percpu_counter_startup); |
| 4 9 9 6 4 5 10 11 11 10 11 11 11 11 11 11 11 92 5 8 3 6 1 5 5 5 5 5 11 16 15 15 11 11 11 9 11 10 5 11 10 16 15 10 28 16 28 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046 4047 4048 4049 4050 4051 4052 4053 4054 4055 4056 4057 4058 4059 4060 4061 4062 4063 4064 4065 4066 4067 4068 4069 4070 4071 4072 4073 4074 4075 4076 4077 4078 4079 4080 4081 4082 4083 4084 4085 4086 4087 4088 4089 4090 4091 4092 4093 4094 4095 4096 4097 4098 4099 4100 4101 4102 4103 4104 4105 4106 4107 4108 4109 4110 4111 4112 4113 4114 4115 4116 4117 4118 4119 4120 4121 4122 4123 4124 4125 4126 4127 4128 4129 4130 4131 4132 4133 4134 4135 4136 4137 4138 4139 4140 4141 4142 4143 4144 4145 4146 4147 4148 4149 4150 4151 4152 4153 4154 4155 4156 4157 4158 4159 4160 4161 4162 4163 4164 4165 4166 4167 4168 4169 4170 4171 4172 4173 4174 4175 4176 4177 4178 4179 4180 4181 4182 4183 4184 4185 4186 4187 4188 4189 4190 4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 4202 4203 4204 4205 4206 4207 4208 4209 4210 4211 4212 4213 4214 4215 4216 4217 4218 4219 4220 4221 4222 4223 4224 4225 4226 4227 4228 4229 4230 4231 4232 4233 4234 4235 4236 4237 4238 4239 4240 4241 4242 4243 4244 4245 4246 4247 4248 4249 4250 4251 4252 4253 4254 4255 4256 4257 4258 4259 4260 4261 4262 4263 4264 4265 4266 4267 4268 4269 4270 4271 4272 4273 4274 4275 4276 4277 4278 4279 4280 4281 4282 4283 4284 4285 4286 4287 4288 4289 4290 4291 4292 4293 4294 4295 4296 4297 4298 4299 4300 4301 4302 4303 4304 4305 4306 4307 4308 4309 4310 4311 4312 4313 4314 4315 4316 4317 4318 4319 4320 4321 4322 4323 4324 4325 4326 4327 4328 4329 4330 4331 4332 4333 4334 4335 4336 4337 4338 4339 4340 4341 4342 4343 4344 4345 4346 4347 4348 4349 4350 4351 4352 4353 4354 4355 4356 4357 4358 4359 4360 4361 4362 4363 4364 4365 4366 4367 4368 4369 4370 4371 4372 4373 4374 4375 4376 4377 4378 4379 4380 4381 4382 4383 4384 4385 4386 4387 4388 4389 4390 4391 4392 4393 4394 4395 4396 4397 4398 4399 4400 4401 4402 4403 4404 4405 4406 4407 4408 4409 4410 4411 4412 4413 4414 4415 4416 4417 4418 4419 4420 4421 4422 4423 4424 4425 4426 4427 4428 4429 4430 4431 4432 4433 4434 4435 4436 4437 4438 4439 4440 4441 4442 4443 4444 4445 4446 4447 4448 4449 4450 4451 4452 4453 4454 4455 4456 4457 4458 4459 4460 4461 4462 4463 4464 4465 4466 4467 4468 4469 4470 4471 4472 4473 4474 4475 4476 4477 4478 4479 4480 4481 4482 4483 4484 4485 4486 4487 4488 4489 4490 4491 4492 4493 4494 4495 4496 4497 4498 4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 4509 4510 4511 4512 4513 4514 4515 4516 4517 4518 4519 4520 4521 4522 4523 4524 4525 4526 4527 4528 4529 4530 4531 4532 4533 4534 4535 4536 4537 4538 4539 4540 4541 4542 4543 4544 4545 4546 4547 4548 4549 4550 4551 4552 4553 4554 4555 4556 4557 4558 4559 4560 4561 4562 4563 4564 4565 4566 4567 4568 4569 4570 4571 4572 4573 4574 4575 4576 4577 4578 4579 4580 4581 4582 4583 4584 4585 4586 4587 4588 4589 4590 4591 4592 4593 4594 4595 4596 4597 4598 4599 4600 4601 4602 4603 4604 4605 4606 4607 4608 4609 4610 4611 4612 4613 4614 4615 4616 4617 4618 4619 4620 4621 4622 4623 4624 4625 4626 4627 4628 4629 4630 4631 4632 4633 4634 4635 4636 4637 4638 4639 4640 4641 4642 4643 4644 4645 4646 4647 4648 4649 4650 4651 4652 4653 4654 4655 4656 4657 4658 4659 4660 4661 4662 4663 4664 4665 4666 4667 4668 4669 4670 4671 4672 4673 4674 4675 4676 4677 4678 4679 4680 4681 4682 4683 4684 4685 4686 4687 4688 4689 4690 4691 4692 4693 4694 4695 4696 4697 4698 4699 4700 4701 4702 4703 4704 4705 4706 4707 4708 4709 4710 4711 4712 4713 4714 4715 4716 4717 4718 4719 4720 4721 4722 4723 4724 4725 4726 4727 4728 4729 4730 4731 4732 4733 4734 4735 4736 4737 4738 4739 4740 4741 4742 4743 4744 4745 4746 4747 4748 4749 4750 4751 4752 4753 4754 4755 4756 4757 4758 4759 4760 4761 4762 4763 4764 4765 4766 4767 4768 4769 4770 4771 4772 4773 4774 4775 4776 4777 4778 4779 4780 4781 4782 4783 4784 4785 4786 4787 4788 4789 4790 4791 4792 4793 4794 4795 4796 4797 4798 4799 4800 4801 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816 4817 4818 4819 4820 4821 4822 4823 4824 4825 4826 4827 4828 4829 4830 4831 4832 4833 4834 4835 4836 4837 4838 4839 4840 4841 4842 4843 4844 4845 4846 4847 4848 4849 4850 4851 4852 4853 4854 4855 4856 4857 4858 4859 4860 4861 4862 4863 4864 4865 4866 4867 4868 4869 4870 4871 4872 4873 4874 4875 4876 4877 4878 4879 4880 4881 4882 4883 4884 4885 4886 4887 4888 4889 4890 4891 4892 4893 4894 4895 4896 4897 4898 4899 4900 4901 4902 4903 4904 4905 4906 4907 4908 4909 4910 4911 4912 4913 4914 4915 4916 4917 4918 4919 4920 4921 4922 4923 4924 4925 4926 4927 4928 4929 4930 4931 4932 4933 4934 4935 4936 4937 4938 4939 4940 4941 4942 4943 4944 4945 4946 4947 4948 4949 4950 4951 4952 4953 4954 4955 4956 4957 4958 4959 4960 4961 4962 4963 4964 4965 4966 4967 4968 4969 4970 4971 4972 4973 4974 4975 4976 4977 4978 4979 4980 4981 4982 4983 4984 4985 4986 4987 4988 4989 4990 4991 4992 4993 4994 4995 4996 4997 4998 4999 5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 5015 5016 5017 5018 5019 5020 5021 5022 5023 5024 5025 5026 5027 5028 5029 5030 5031 5032 5033 5034 5035 5036 5037 5038 5039 5040 5041 5042 5043 5044 5045 5046 5047 5048 5049 5050 5051 5052 5053 5054 5055 5056 5057 5058 5059 5060 5061 5062 5063 5064 5065 5066 5067 5068 5069 5070 5071 5072 5073 5074 5075 5076 5077 5078 5079 5080 5081 5082 5083 5084 5085 5086 5087 5088 5089 5090 5091 5092 5093 5094 5095 5096 5097 5098 5099 5100 5101 5102 5103 5104 5105 5106 5107 5108 5109 5110 5111 5112 5113 5114 5115 5116 5117 5118 5119 5120 5121 5122 5123 5124 5125 5126 5127 5128 5129 5130 5131 5132 5133 5134 5135 5136 5137 5138 5139 5140 5141 5142 5143 5144 5145 5146 5147 5148 5149 5150 5151 5152 5153 5154 5155 5156 5157 5158 5159 5160 5161 5162 5163 5164 5165 5166 5167 5168 5169 5170 5171 5172 5173 5174 5175 5176 5177 5178 5179 5180 5181 5182 5183 5184 5185 5186 5187 5188 5189 5190 5191 5192 5193 5194 5195 5196 5197 5198 5199 5200 5201 5202 5203 5204 5205 5206 5207 5208 5209 5210 5211 5212 5213 5214 5215 5216 5217 5218 5219 5220 5221 5222 5223 5224 5225 5226 5227 5228 5229 5230 5231 5232 5233 5234 5235 5236 5237 5238 5239 5240 5241 5242 5243 5244 5245 5246 5247 5248 5249 5250 5251 5252 5253 5254 5255 5256 5257 5258 5259 5260 5261 5262 5263 5264 5265 5266 5267 5268 5269 5270 5271 5272 5273 5274 5275 5276 5277 5278 5279 5280 5281 5282 5283 5284 5285 5286 5287 5288 5289 5290 5291 5292 5293 5294 5295 5296 5297 5298 5299 5300 5301 5302 5303 5304 5305 5306 5307 5308 5309 5310 5311 5312 5313 5314 5315 5316 5317 5318 5319 5320 5321 5322 5323 5324 5325 5326 5327 5328 5329 5330 5331 5332 5333 5334 5335 5336 5337 5338 5339 5340 5341 5342 5343 5344 5345 5346 5347 5348 5349 5350 5351 5352 5353 5354 5355 5356 5357 5358 5359 5360 5361 5362 5363 5364 5365 5366 5367 5368 5369 5370 5371 5372 5373 5374 5375 5376 5377 5378 5379 5380 5381 5382 5383 5384 5385 5386 5387 5388 5389 5390 5391 5392 5393 5394 5395 5396 5397 5398 5399 5400 5401 5402 5403 5404 5405 5406 5407 5408 5409 5410 5411 5412 5413 5414 5415 5416 5417 5418 5419 5420 5421 5422 5423 5424 5425 5426 5427 5428 5429 5430 5431 5432 5433 5434 5435 5436 5437 5438 5439 5440 5441 5442 5443 5444 5445 5446 5447 5448 5449 5450 5451 5452 5453 5454 5455 5456 5457 5458 5459 5460 5461 5462 5463 5464 5465 5466 5467 5468 5469 5470 5471 5472 5473 5474 5475 5476 5477 5478 5479 5480 5481 5482 5483 5484 5485 5486 5487 5488 5489 5490 5491 5492 5493 5494 5495 5496 5497 5498 5499 5500 5501 5502 5503 5504 5505 5506 5507 5508 5509 5510 5511 5512 5513 5514 5515 5516 5517 5518 5519 5520 5521 5522 5523 5524 5525 5526 5527 5528 5529 5530 5531 5532 5533 5534 5535 5536 5537 5538 5539 5540 5541 5542 5543 5544 5545 5546 5547 5548 5549 5550 5551 5552 5553 5554 5555 5556 5557 5558 5559 5560 5561 5562 5563 5564 5565 5566 5567 5568 5569 5570 5571 5572 5573 5574 5575 5576 5577 5578 5579 5580 5581 5582 5583 5584 5585 5586 5587 5588 5589 5590 5591 5592 5593 5594 5595 5596 5597 5598 5599 5600 5601 5602 5603 5604 5605 5606 5607 5608 5609 5610 5611 5612 5613 5614 5615 5616 5617 5618 5619 5620 5621 5622 5623 5624 5625 5626 5627 5628 5629 5630 5631 5632 5633 5634 5635 5636 5637 5638 5639 5640 5641 5642 5643 5644 5645 5646 5647 5648 5649 5650 5651 5652 5653 5654 5655 5656 5657 5658 5659 5660 5661 5662 5663 5664 5665 5666 5667 5668 5669 5670 5671 5672 5673 5674 5675 5676 5677 5678 5679 5680 5681 5682 5683 5684 5685 5686 5687 5688 5689 5690 5691 5692 5693 5694 5695 5696 5697 5698 5699 5700 5701 5702 5703 5704 5705 5706 5707 5708 5709 5710 5711 5712 5713 5714 5715 5716 5717 5718 5719 5720 5721 5722 5723 5724 5725 5726 5727 5728 5729 5730 5731 5732 5733 5734 5735 5736 5737 5738 5739 5740 5741 5742 5743 5744 5745 5746 5747 5748 5749 5750 5751 5752 5753 5754 5755 5756 5757 5758 5759 5760 5761 5762 5763 5764 5765 5766 5767 5768 5769 5770 5771 5772 5773 5774 5775 5776 5777 5778 5779 5780 5781 5782 5783 5784 5785 5786 5787 5788 5789 5790 5791 5792 5793 5794 5795 5796 5797 5798 5799 5800 5801 5802 5803 5804 5805 5806 5807 5808 5809 5810 5811 5812 5813 5814 5815 5816 5817 5818 5819 5820 5821 5822 5823 5824 5825 5826 5827 5828 5829 5830 5831 5832 5833 5834 5835 5836 5837 5838 5839 5840 5841 5842 5843 5844 5845 5846 5847 5848 5849 5850 5851 5852 5853 5854 5855 5856 5857 5858 5859 5860 5861 5862 5863 5864 5865 5866 5867 5868 5869 5870 5871 5872 5873 5874 5875 5876 5877 5878 5879 5880 5881 5882 5883 5884 5885 5886 5887 5888 5889 5890 5891 5892 5893 5894 5895 5896 5897 5898 5899 5900 5901 5902 5903 5904 5905 5906 5907 5908 5909 5910 5911 5912 5913 5914 5915 5916 5917 5918 5919 5920 5921 5922 5923 5924 5925 5926 5927 5928 5929 5930 5931 5932 5933 5934 5935 5936 5937 5938 5939 5940 5941 5942 5943 5944 5945 5946 5947 5948 5949 5950 5951 5952 5953 5954 5955 5956 5957 5958 5959 5960 5961 5962 5963 5964 5965 5966 5967 5968 5969 5970 5971 5972 5973 5974 5975 5976 5977 5978 5979 5980 5981 5982 5983 5984 5985 5986 5987 5988 5989 5990 5991 5992 5993 5994 5995 5996 5997 5998 5999 6000 6001 6002 6003 6004 6005 6006 6007 6008 6009 6010 6011 6012 6013 6014 6015 6016 6017 6018 6019 6020 6021 6022 6023 6024 6025 6026 6027 6028 6029 6030 6031 6032 6033 6034 6035 6036 6037 6038 6039 6040 6041 6042 6043 6044 6045 6046 6047 6048 6049 6050 6051 6052 6053 6054 6055 6056 6057 6058 6059 6060 6061 6062 6063 6064 6065 6066 6067 6068 6069 6070 6071 6072 6073 6074 6075 6076 6077 6078 6079 6080 6081 6082 6083 6084 6085 6086 6087 6088 6089 6090 6091 6092 6093 6094 6095 6096 6097 6098 6099 6100 6101 6102 6103 6104 6105 6106 6107 6108 6109 6110 6111 6112 6113 6114 6115 6116 6117 6118 6119 6120 6121 6122 6123 6124 6125 6126 6127 6128 6129 6130 6131 6132 6133 6134 6135 6136 6137 6138 6139 6140 6141 6142 6143 6144 6145 6146 6147 6148 6149 6150 6151 6152 6153 6154 6155 6156 6157 6158 6159 6160 6161 6162 6163 6164 6165 6166 6167 6168 6169 6170 6171 6172 6173 6174 6175 6176 6177 6178 6179 6180 6181 6182 6183 6184 6185 6186 6187 6188 6189 6190 6191 6192 6193 6194 6195 6196 6197 6198 6199 6200 6201 6202 6203 6204 6205 6206 6207 6208 6209 6210 6211 6212 6213 6214 6215 6216 6217 6218 6219 6220 6221 6222 6223 6224 6225 6226 6227 6228 6229 6230 6231 6232 6233 6234 6235 6236 6237 6238 6239 6240 6241 6242 6243 6244 6245 6246 6247 6248 6249 6250 6251 6252 6253 6254 6255 6256 6257 6258 6259 6260 6261 6262 6263 6264 6265 6266 6267 6268 6269 6270 6271 6272 6273 6274 6275 6276 6277 6278 6279 6280 6281 6282 6283 6284 6285 6286 6287 6288 6289 6290 6291 6292 6293 6294 6295 6296 6297 6298 6299 6300 6301 6302 6303 6304 6305 6306 6307 6308 6309 6310 6311 6312 6313 6314 6315 6316 6317 6318 6319 6320 6321 6322 6323 6324 6325 6326 6327 6328 6329 6330 6331 6332 6333 6334 6335 6336 6337 6338 6339 6340 6341 6342 6343 6344 6345 6346 6347 6348 6349 6350 6351 6352 6353 6354 6355 6356 6357 6358 6359 6360 6361 6362 6363 6364 6365 6366 6367 6368 6369 6370 6371 6372 6373 6374 6375 6376 6377 6378 6379 6380 6381 6382 6383 6384 6385 6386 6387 6388 6389 6390 6391 6392 6393 6394 6395 6396 6397 6398 6399 6400 6401 6402 6403 6404 6405 6406 6407 6408 6409 6410 6411 6412 6413 6414 6415 6416 6417 6418 6419 6420 6421 6422 6423 6424 6425 6426 6427 6428 6429 6430 6431 6432 6433 6434 6435 6436 6437 6438 6439 6440 6441 6442 6443 6444 6445 6446 6447 6448 6449 6450 6451 6452 6453 6454 6455 6456 6457 6458 6459 6460 6461 6462 6463 6464 6465 6466 6467 6468 6469 6470 6471 6472 6473 6474 6475 6476 6477 6478 6479 6480 6481 6482 6483 6484 6485 6486 6487 6488 6489 6490 6491 6492 6493 6494 6495 6496 6497 6498 6499 6500 6501 6502 6503 6504 6505 6506 6507 6508 6509 6510 6511 6512 6513 6514 6515 6516 6517 6518 6519 6520 6521 6522 6523 6524 6525 6526 6527 6528 6529 6530 6531 6532 6533 6534 6535 6536 6537 6538 6539 6540 6541 6542 6543 6544 6545 6546 6547 6548 6549 6550 6551 6552 6553 6554 6555 6556 6557 6558 6559 6560 6561 6562 6563 6564 6565 6566 6567 6568 6569 6570 6571 6572 6573 6574 6575 6576 6577 6578 6579 6580 6581 6582 6583 6584 6585 6586 6587 6588 6589 6590 6591 6592 6593 6594 6595 6596 6597 6598 6599 6600 6601 6602 6603 6604 6605 6606 6607 6608 6609 6610 6611 6612 6613 6614 6615 6616 6617 6618 6619 6620 6621 6622 6623 6624 6625 6626 6627 6628 6629 6630 6631 6632 6633 6634 6635 6636 6637 6638 6639 6640 6641 6642 6643 6644 6645 6646 6647 6648 6649 6650 6651 6652 6653 6654 6655 6656 6657 6658 6659 6660 6661 6662 6663 6664 6665 6666 6667 6668 6669 6670 6671 6672 6673 6674 6675 6676 6677 6678 6679 6680 6681 6682 6683 6684 6685 6686 6687 6688 6689 6690 6691 6692 6693 6694 6695 6696 6697 6698 6699 6700 6701 6702 6703 6704 6705 6706 6707 6708 6709 6710 6711 6712 6713 6714 6715 6716 6717 6718 6719 6720 6721 6722 6723 6724 6725 6726 6727 6728 6729 6730 6731 6732 6733 6734 6735 6736 6737 6738 6739 6740 6741 6742 6743 6744 6745 6746 6747 6748 6749 6750 6751 6752 6753 6754 6755 6756 6757 6758 6759 6760 6761 6762 6763 6764 6765 6766 6767 6768 6769 6770 6771 6772 6773 6774 6775 6776 6777 6778 6779 6780 6781 6782 6783 6784 6785 6786 6787 6788 6789 6790 6791 6792 6793 6794 6795 6796 6797 6798 6799 6800 6801 6802 6803 6804 6805 6806 6807 6808 6809 6810 6811 6812 6813 6814 6815 6816 6817 6818 6819 6820 6821 6822 6823 6824 6825 6826 6827 6828 6829 6830 6831 6832 6833 6834 6835 6836 6837 6838 6839 6840 6841 6842 6843 6844 6845 6846 6847 6848 6849 6850 6851 6852 6853 6854 6855 6856 6857 6858 6859 6860 6861 6862 6863 6864 6865 6866 6867 6868 6869 6870 6871 6872 6873 6874 6875 6876 6877 6878 6879 6880 6881 6882 6883 6884 6885 6886 6887 6888 6889 6890 6891 6892 6893 6894 6895 6896 6897 6898 6899 6900 6901 6902 6903 6904 6905 6906 6907 6908 6909 6910 6911 6912 6913 6914 6915 6916 6917 6918 6919 6920 6921 6922 6923 6924 6925 6926 6927 6928 6929 6930 6931 6932 6933 6934 6935 6936 6937 6938 6939 6940 6941 6942 6943 6944 6945 6946 6947 6948 6949 6950 6951 6952 6953 6954 6955 6956 6957 6958 6959 6960 6961 6962 6963 6964 6965 6966 6967 6968 6969 6970 6971 6972 6973 6974 6975 6976 6977 6978 6979 6980 6981 6982 6983 6984 6985 6986 6987 6988 6989 6990 6991 6992 6993 6994 6995 6996 6997 6998 6999 7000 7001 7002 7003 7004 7005 7006 7007 7008 7009 7010 7011 7012 7013 7014 7015 7016 7017 7018 7019 7020 7021 7022 7023 7024 7025 7026 7027 7028 7029 7030 7031 7032 7033 7034 7035 7036 7037 7038 7039 7040 7041 7042 7043 7044 7045 7046 7047 7048 7049 7050 7051 7052 7053 7054 7055 7056 7057 7058 7059 7060 7061 7062 7063 7064 7065 7066 7067 7068 7069 7070 7071 7072 7073 7074 7075 7076 7077 7078 7079 7080 7081 7082 7083 7084 7085 7086 7087 7088 7089 7090 7091 7092 7093 7094 7095 7096 7097 7098 7099 7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 7116 7117 7118 7119 7120 7121 7122 7123 7124 7125 7126 7127 7128 7129 7130 7131 7132 7133 7134 7135 7136 7137 7138 7139 7140 7141 7142 7143 7144 7145 7146 7147 7148 7149 7150 7151 7152 7153 7154 7155 7156 7157 7158 7159 7160 7161 7162 7163 7164 7165 7166 7167 7168 7169 7170 7171 7172 7173 7174 7175 7176 7177 7178 7179 7180 7181 7182 7183 7184 7185 7186 7187 7188 7189 7190 7191 7192 7193 7194 7195 7196 7197 7198 7199 7200 7201 7202 7203 7204 7205 7206 7207 7208 7209 7210 7211 7212 7213 7214 7215 7216 7217 7218 7219 7220 7221 7222 7223 7224 7225 7226 7227 7228 7229 7230 7231 7232 7233 7234 7235 7236 7237 7238 7239 7240 7241 7242 7243 7244 7245 7246 7247 7248 7249 7250 7251 7252 7253 7254 7255 7256 7257 7258 7259 7260 7261 7262 7263 7264 7265 7266 7267 7268 7269 7270 7271 7272 7273 7274 7275 7276 7277 7278 7279 7280 7281 7282 7283 7284 7285 7286 7287 7288 7289 7290 7291 7292 7293 7294 7295 7296 7297 7298 7299 7300 7301 7302 7303 7304 7305 7306 7307 7308 7309 7310 7311 7312 7313 7314 7315 7316 7317 7318 7319 7320 7321 7322 7323 7324 7325 7326 7327 7328 7329 7330 7331 7332 7333 7334 7335 7336 7337 7338 7339 7340 7341 7342 7343 7344 7345 7346 7347 7348 7349 7350 7351 7352 7353 7354 7355 7356 7357 7358 7359 7360 7361 7362 7363 7364 7365 7366 7367 7368 7369 7370 7371 7372 7373 7374 7375 7376 7377 7378 7379 7380 7381 7382 7383 7384 7385 7386 7387 7388 7389 7390 7391 7392 7393 7394 7395 7396 7397 7398 7399 7400 7401 7402 7403 7404 7405 7406 7407 7408 7409 7410 7411 7412 7413 7414 7415 7416 7417 7418 7419 7420 7421 7422 7423 7424 7425 7426 7427 7428 7429 7430 7431 7432 7433 7434 7435 7436 7437 7438 7439 7440 7441 7442 7443 7444 7445 7446 7447 7448 7449 7450 7451 7452 7453 7454 7455 7456 7457 7458 7459 7460 7461 7462 7463 7464 7465 7466 7467 7468 7469 7470 7471 7472 7473 7474 7475 7476 7477 7478 7479 7480 7481 7482 7483 7484 7485 7486 7487 7488 7489 7490 7491 7492 7493 7494 7495 7496 7497 7498 7499 7500 7501 7502 7503 7504 7505 7506 7507 7508 7509 7510 7511 7512 7513 7514 7515 7516 7517 7518 7519 7520 7521 7522 7523 7524 7525 7526 7527 7528 7529 7530 7531 7532 7533 7534 7535 7536 7537 7538 7539 7540 7541 7542 7543 7544 7545 7546 7547 7548 7549 7550 7551 7552 7553 7554 7555 7556 7557 7558 7559 7560 7561 7562 7563 7564 7565 7566 7567 7568 7569 7570 7571 7572 7573 7574 7575 7576 7577 7578 7579 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 7590 7591 7592 7593 7594 7595 7596 7597 7598 7599 7600 7601 7602 7603 7604 7605 7606 7607 7608 7609 7610 7611 7612 7613 7614 7615 7616 7617 7618 7619 7620 7621 7622 7623 7624 7625 7626 7627 7628 7629 7630 7631 7632 7633 7634 7635 7636 7637 7638 7639 7640 7641 7642 7643 7644 7645 7646 7647 7648 7649 7650 7651 7652 7653 7654 7655 7656 7657 7658 7659 7660 7661 7662 7663 7664 7665 7666 7667 7668 7669 7670 7671 7672 7673 7674 7675 7676 7677 7678 7679 7680 7681 7682 7683 7684 7685 7686 7687 7688 7689 7690 7691 7692 7693 7694 7695 7696 7697 7698 7699 7700 7701 7702 7703 7704 7705 7706 7707 7708 7709 7710 7711 7712 7713 7714 7715 7716 7717 7718 7719 7720 7721 7722 7723 7724 7725 7726 7727 7728 7729 7730 7731 7732 7733 7734 7735 7736 7737 7738 7739 7740 7741 7742 7743 7744 7745 7746 7747 7748 7749 7750 7751 7752 7753 7754 7755 7756 7757 7758 7759 7760 7761 7762 7763 7764 7765 7766 7767 7768 7769 7770 7771 7772 7773 7774 7775 7776 7777 7778 7779 7780 7781 7782 7783 7784 7785 7786 7787 7788 7789 7790 7791 7792 7793 7794 7795 7796 7797 7798 7799 7800 7801 7802 7803 7804 7805 7806 7807 7808 7809 7810 7811 7812 7813 7814 7815 7816 7817 7818 7819 7820 7821 7822 7823 7824 7825 7826 7827 7828 7829 7830 7831 7832 7833 7834 7835 7836 7837 7838 7839 7840 7841 7842 7843 7844 7845 7846 7847 7848 7849 7850 7851 7852 7853 7854 7855 7856 7857 7858 7859 7860 7861 7862 7863 7864 7865 7866 7867 7868 7869 7870 7871 7872 7873 7874 7875 7876 7877 7878 7879 7880 7881 7882 7883 7884 7885 7886 7887 7888 7889 7890 7891 7892 7893 7894 7895 7896 7897 7898 7899 7900 7901 7902 7903 7904 7905 7906 7907 7908 7909 7910 7911 7912 7913 7914 7915 7916 7917 7918 7919 7920 7921 7922 7923 7924 7925 7926 7927 7928 7929 7930 7931 7932 7933 7934 7935 7936 7937 7938 7939 7940 7941 7942 7943 7944 7945 7946 7947 7948 7949 7950 7951 7952 7953 7954 7955 7956 7957 7958 7959 7960 7961 7962 7963 7964 7965 7966 7967 7968 7969 7970 7971 7972 7973 7974 7975 7976 7977 7978 7979 7980 7981 7982 7983 7984 7985 7986 7987 7988 7989 7990 7991 7992 7993 7994 7995 7996 7997 7998 7999 8000 8001 8002 8003 8004 8005 8006 8007 8008 8009 8010 8011 8012 8013 8014 8015 8016 8017 8018 8019 8020 8021 8022 8023 8024 8025 8026 8027 8028 8029 8030 8031 8032 8033 8034 8035 8036 8037 8038 8039 8040 8041 8042 8043 8044 8045 8046 8047 8048 8049 8050 8051 8052 8053 8054 8055 8056 8057 8058 8059 8060 8061 8062 8063 8064 8065 8066 8067 8068 8069 8070 8071 8072 8073 8074 8075 8076 8077 8078 8079 8080 8081 8082 8083 8084 8085 8086 8087 8088 8089 8090 8091 8092 8093 8094 8095 8096 8097 8098 8099 8100 8101 8102 8103 8104 8105 8106 8107 8108 8109 8110 8111 8112 8113 8114 8115 8116 8117 8118 8119 8120 8121 8122 8123 8124 8125 8126 8127 8128 8129 8130 8131 8132 8133 8134 8135 8136 8137 8138 8139 8140 8141 8142 8143 8144 8145 8146 8147 8148 8149 8150 8151 8152 8153 8154 8155 8156 8157 8158 8159 8160 8161 8162 8163 8164 8165 8166 8167 8168 8169 8170 8171 8172 8173 8174 8175 8176 8177 8178 8179 8180 8181 8182 8183 8184 8185 8186 8187 8188 8189 8190 8191 8192 8193 8194 8195 8196 8197 8198 8199 8200 8201 8202 8203 8204 8205 8206 8207 8208 8209 8210 8211 8212 8213 8214 8215 8216 8217 8218 8219 8220 8221 8222 8223 8224 8225 8226 8227 8228 8229 8230 8231 8232 8233 8234 8235 8236 8237 8238 8239 8240 8241 8242 8243 8244 8245 8246 8247 8248 8249 8250 8251 8252 8253 8254 8255 8256 8257 8258 8259 8260 8261 8262 8263 8264 8265 8266 8267 8268 8269 8270 8271 8272 8273 8274 8275 8276 8277 8278 8279 8280 8281 8282 8283 8284 8285 8286 8287 8288 8289 8290 8291 8292 8293 8294 8295 8296 8297 8298 8299 8300 8301 8302 8303 8304 8305 8306 8307 8308 8309 8310 8311 8312 8313 8314 8315 8316 8317 8318 8319 8320 8321 8322 8323 8324 8325 8326 8327 8328 8329 8330 8331 8332 8333 8334 8335 8336 8337 8338 8339 8340 8341 8342 8343 8344 8345 8346 8347 8348 8349 8350 8351 8352 8353 8354 8355 8356 8357 8358 8359 8360 8361 8362 8363 8364 8365 8366 8367 8368 8369 8370 8371 8372 8373 8374 8375 8376 8377 8378 8379 8380 8381 8382 8383 8384 8385 8386 8387 8388 8389 8390 8391 8392 8393 8394 8395 8396 8397 8398 8399 8400 8401 8402 8403 8404 8405 8406 8407 8408 8409 8410 8411 8412 8413 8414 8415 8416 8417 8418 8419 8420 8421 8422 8423 8424 8425 8426 8427 8428 8429 8430 8431 8432 8433 8434 8435 8436 8437 8438 8439 8440 8441 8442 8443 8444 8445 8446 8447 8448 8449 8450 8451 8452 8453 8454 8455 8456 8457 8458 8459 8460 8461 8462 8463 8464 8465 8466 8467 8468 8469 8470 8471 8472 8473 8474 8475 8476 8477 8478 8479 8480 8481 8482 8483 8484 8485 8486 8487 8488 8489 8490 8491 8492 8493 8494 8495 8496 8497 8498 8499 8500 8501 8502 8503 8504 8505 8506 8507 8508 8509 8510 8511 8512 8513 8514 8515 8516 8517 8518 8519 8520 8521 8522 8523 8524 8525 8526 8527 8528 8529 8530 8531 8532 8533 8534 8535 8536 8537 8538 8539 8540 8541 8542 8543 8544 8545 8546 8547 8548 8549 8550 8551 8552 8553 8554 8555 8556 8557 8558 8559 8560 8561 8562 8563 8564 8565 8566 8567 8568 8569 8570 8571 8572 8573 8574 8575 8576 8577 8578 8579 8580 8581 8582 8583 8584 8585 8586 8587 8588 8589 8590 8591 8592 8593 8594 8595 8596 8597 8598 8599 8600 8601 8602 8603 8604 8605 8606 8607 8608 8609 8610 8611 8612 8613 8614 8615 8616 8617 8618 8619 8620 8621 8622 8623 8624 8625 8626 8627 8628 8629 8630 8631 8632 8633 8634 8635 8636 8637 8638 8639 8640 8641 8642 8643 8644 8645 8646 8647 8648 8649 8650 8651 8652 8653 8654 8655 8656 8657 8658 8659 8660 8661 8662 8663 8664 8665 8666 8667 8668 8669 8670 8671 8672 8673 8674 8675 8676 8677 8678 8679 8680 8681 8682 8683 8684 8685 8686 8687 8688 8689 8690 8691 8692 8693 8694 8695 8696 8697 8698 8699 8700 8701 8702 8703 8704 8705 8706 8707 8708 8709 8710 8711 8712 8713 8714 8715 8716 8717 8718 8719 8720 8721 8722 8723 8724 8725 8726 8727 8728 8729 8730 8731 8732 8733 8734 8735 8736 8737 8738 8739 8740 8741 8742 8743 8744 8745 8746 8747 8748 8749 8750 8751 8752 8753 8754 8755 8756 8757 8758 8759 8760 8761 8762 8763 8764 8765 8766 8767 8768 8769 8770 8771 8772 8773 8774 8775 8776 8777 8778 8779 8780 8781 8782 8783 8784 8785 8786 8787 8788 8789 8790 8791 8792 8793 8794 8795 8796 8797 8798 8799 8800 8801 8802 8803 8804 8805 8806 8807 8808 8809 8810 8811 8812 8813 8814 8815 8816 8817 8818 8819 8820 8821 8822 8823 8824 8825 8826 8827 8828 8829 8830 8831 8832 8833 8834 8835 8836 8837 8838 8839 8840 8841 8842 8843 8844 8845 8846 8847 8848 8849 8850 8851 8852 8853 8854 8855 8856 8857 8858 8859 8860 8861 8862 8863 8864 8865 8866 8867 8868 8869 8870 8871 8872 8873 8874 8875 8876 8877 8878 8879 8880 8881 8882 8883 8884 8885 8886 8887 8888 8889 8890 8891 8892 8893 8894 8895 8896 8897 8898 8899 8900 8901 8902 8903 8904 8905 8906 8907 8908 8909 8910 8911 8912 8913 8914 8915 8916 8917 8918 8919 8920 8921 8922 8923 8924 8925 8926 8927 8928 8929 8930 8931 8932 8933 8934 8935 8936 8937 8938 8939 8940 8941 8942 8943 8944 8945 8946 8947 8948 8949 8950 8951 8952 8953 8954 8955 8956 8957 8958 8959 8960 8961 8962 8963 8964 8965 8966 8967 8968 8969 8970 8971 8972 8973 8974 8975 8976 8977 8978 8979 8980 8981 8982 8983 8984 8985 8986 8987 8988 8989 8990 8991 8992 8993 8994 8995 8996 8997 8998 8999 9000 9001 9002 9003 9004 9005 9006 9007 9008 9009 9010 9011 9012 9013 9014 9015 9016 9017 9018 9019 9020 9021 9022 9023 9024 9025 9026 9027 9028 9029 9030 9031 9032 9033 9034 9035 9036 9037 9038 9039 9040 9041 9042 9043 9044 9045 9046 9047 9048 9049 9050 9051 9052 9053 9054 9055 9056 9057 9058 9059 9060 9061 9062 9063 9064 9065 9066 9067 9068 9069 9070 9071 9072 9073 9074 9075 9076 9077 9078 9079 9080 9081 9082 9083 9084 9085 9086 9087 9088 9089 9090 9091 9092 9093 9094 9095 9096 9097 9098 9099 9100 9101 9102 9103 9104 9105 9106 9107 9108 9109 9110 9111 9112 9113 9114 9115 9116 9117 9118 9119 9120 9121 9122 9123 9124 9125 9126 9127 9128 9129 9130 9131 9132 9133 9134 9135 9136 9137 9138 9139 9140 9141 9142 9143 9144 9145 9146 9147 9148 9149 9150 9151 9152 9153 9154 9155 9156 9157 9158 9159 9160 9161 9162 9163 9164 9165 9166 9167 9168 9169 9170 9171 9172 9173 9174 9175 9176 9177 9178 9179 9180 9181 9182 9183 9184 9185 9186 9187 9188 9189 9190 9191 9192 9193 9194 9195 9196 9197 9198 9199 9200 9201 9202 9203 9204 9205 9206 9207 9208 9209 9210 9211 9212 9213 9214 9215 9216 9217 9218 9219 9220 9221 9222 9223 9224 9225 9226 9227 9228 9229 9230 9231 9232 9233 9234 9235 9236 9237 9238 9239 9240 9241 9242 9243 9244 9245 9246 9247 9248 9249 9250 9251 9252 9253 9254 9255 9256 9257 9258 9259 9260 9261 9262 9263 9264 9265 9266 9267 9268 9269 9270 9271 9272 9273 9274 9275 9276 9277 9278 9279 9280 9281 9282 9283 9284 9285 9286 9287 9288 9289 9290 9291 9292 9293 9294 9295 9296 9297 9298 9299 9300 9301 9302 9303 9304 9305 9306 9307 9308 9309 9310 9311 9312 9313 9314 9315 9316 9317 9318 9319 9320 9321 9322 9323 9324 9325 9326 9327 9328 9329 9330 9331 9332 9333 9334 9335 9336 9337 9338 9339 9340 9341 9342 9343 9344 9345 9346 9347 9348 9349 9350 9351 9352 9353 9354 9355 9356 9357 9358 9359 9360 9361 9362 9363 9364 9365 9366 9367 9368 9369 9370 9371 9372 9373 9374 9375 9376 9377 9378 9379 9380 9381 9382 9383 9384 9385 9386 9387 9388 9389 9390 9391 9392 9393 9394 9395 9396 9397 9398 9399 9400 9401 9402 9403 9404 9405 9406 9407 9408 9409 9410 9411 9412 9413 9414 9415 9416 9417 9418 9419 9420 9421 9422 9423 9424 9425 9426 9427 9428 9429 9430 9431 9432 9433 9434 9435 9436 9437 9438 9439 9440 9441 9442 9443 9444 9445 9446 9447 9448 9449 9450 9451 9452 9453 9454 9455 9456 9457 9458 9459 9460 9461 9462 9463 9464 9465 9466 9467 9468 9469 9470 9471 9472 9473 9474 9475 9476 9477 9478 9479 9480 9481 9482 9483 9484 9485 9486 9487 9488 9489 9490 9491 9492 9493 9494 9495 9496 9497 9498 9499 9500 9501 9502 9503 9504 9505 9506 9507 9508 9509 9510 9511 9512 9513 9514 9515 9516 9517 9518 9519 9520 9521 9522 9523 9524 9525 9526 9527 9528 9529 9530 9531 9532 9533 9534 9535 9536 9537 9538 9539 9540 9541 9542 9543 9544 9545 9546 9547 9548 9549 9550 9551 9552 9553 9554 9555 9556 9557 9558 9559 9560 9561 9562 9563 9564 9565 9566 9567 9568 9569 9570 9571 9572 9573 9574 9575 9576 9577 9578 9579 | // SPDX-License-Identifier: GPL-2.0 /* Copyright (c) 2018 Facebook */ #include <uapi/linux/btf.h> #include <uapi/linux/bpf.h> #include <uapi/linux/bpf_perf_event.h> #include <uapi/linux/types.h> #include <linux/seq_file.h> #include <linux/compiler.h> #include <linux/ctype.h> #include <linux/errno.h> #include <linux/slab.h> #include <linux/anon_inodes.h> #include <linux/file.h> #include <linux/uaccess.h> #include <linux/kernel.h> #include <linux/idr.h> #include <linux/sort.h> #include <linux/bpf_verifier.h> #include <linux/btf.h> #include <linux/btf_ids.h> #include <linux/bpf.h> #include <linux/bpf_lsm.h> #include <linux/skmsg.h> #include <linux/perf_event.h> #include <linux/bsearch.h> #include <linux/kobject.h> #include <linux/sysfs.h> #include <linux/overflow.h> #include <net/netfilter/nf_bpf_link.h> #include <net/sock.h> #include <net/xdp.h> #include "../tools/lib/bpf/relo_core.h" /* BTF (BPF Type Format) is the meta data format which describes * the data types of BPF program/map. Hence, it basically focus * on the C programming language which the modern BPF is primary * using. * * ELF Section: * ~~~~~~~~~~~ * The BTF data is stored under the ".BTF" ELF section * * struct btf_type: * ~~~~~~~~~~~~~~~ * Each 'struct btf_type' object describes a C data type. * Depending on the type it is describing, a 'struct btf_type' * object may be followed by more data. F.e. * To describe an array, 'struct btf_type' is followed by * 'struct btf_array'. * * 'struct btf_type' and any extra data following it are * 4 bytes aligned. * * Type section: * ~~~~~~~~~~~~~ * The BTF type section contains a list of 'struct btf_type' objects. * Each one describes a C type. Recall from the above section * that a 'struct btf_type' object could be immediately followed by extra * data in order to describe some particular C types. * * type_id: * ~~~~~~~ * Each btf_type object is identified by a type_id. The type_id * is implicitly implied by the location of the btf_type object in * the BTF type section. The first one has type_id 1. The second * one has type_id 2...etc. Hence, an earlier btf_type has * a smaller type_id. * * A btf_type object may refer to another btf_type object by using * type_id (i.e. the "type" in the "struct btf_type"). * * NOTE that we cannot assume any reference-order. * A btf_type object can refer to an earlier btf_type object * but it can also refer to a later btf_type object. * * For example, to describe "const void *". A btf_type * object describing "const" may refer to another btf_type * object describing "void *". This type-reference is done * by specifying type_id: * * [1] CONST (anon) type_id=2 * [2] PTR (anon) type_id=0 * * The above is the btf_verifier debug log: * - Each line started with "[?]" is a btf_type object * - [?] is the type_id of the btf_type object. * - CONST/PTR is the BTF_KIND_XXX * - "(anon)" is the name of the type. It just * happens that CONST and PTR has no name. * - type_id=XXX is the 'u32 type' in btf_type * * NOTE: "void" has type_id 0 * * String section: * ~~~~~~~~~~~~~~ * The BTF string section contains the names used by the type section. * Each string is referred by an "offset" from the beginning of the * string section. * * Each string is '\0' terminated. * * The first character in the string section must be '\0' * which is used to mean 'anonymous'. Some btf_type may not * have a name. */ /* BTF verification: * * To verify BTF data, two passes are needed. * * Pass #1 * ~~~~~~~ * The first pass is to collect all btf_type objects to * an array: "btf->types". * * Depending on the C type that a btf_type is describing, * a btf_type may be followed by extra data. We don't know * how many btf_type is there, and more importantly we don't * know where each btf_type is located in the type section. * * Without knowing the location of each type_id, most verifications * cannot be done. e.g. an earlier btf_type may refer to a later * btf_type (recall the "const void *" above), so we cannot * check this type-reference in the first pass. * * In the first pass, it still does some verifications (e.g. * checking the name is a valid offset to the string section). * * Pass #2 * ~~~~~~~ * The main focus is to resolve a btf_type that is referring * to another type. * * We have to ensure the referring type: * 1) does exist in the BTF (i.e. in btf->types[]) * 2) does not cause a loop: * struct A { * struct B b; * }; * * struct B { * struct A a; * }; * * btf_type_needs_resolve() decides if a btf_type needs * to be resolved. * * The needs_resolve type implements the "resolve()" ops which * essentially does a DFS and detects backedge. * * During resolve (or DFS), different C types have different * "RESOLVED" conditions. * * When resolving a BTF_KIND_STRUCT, we need to resolve all its * members because a member is always referring to another * type. A struct's member can be treated as "RESOLVED" if * it is referring to a BTF_KIND_PTR. Otherwise, the * following valid C struct would be rejected: * * struct A { * int m; * struct A *a; * }; * * When resolving a BTF_KIND_PTR, it needs to keep resolving if * it is referring to another BTF_KIND_PTR. Otherwise, we cannot * detect a pointer loop, e.g.: * BTF_KIND_CONST -> BTF_KIND_PTR -> BTF_KIND_CONST -> BTF_KIND_PTR + * ^ | * +-----------------------------------------+ * */ #define BITS_PER_U128 (sizeof(u64) * BITS_PER_BYTE * 2) #define BITS_PER_BYTE_MASK (BITS_PER_BYTE - 1) #define BITS_PER_BYTE_MASKED(bits) ((bits) & BITS_PER_BYTE_MASK) #define BITS_ROUNDDOWN_BYTES(bits) ((bits) >> 3) #define BITS_ROUNDUP_BYTES(bits) \ (BITS_ROUNDDOWN_BYTES(bits) + !!BITS_PER_BYTE_MASKED(bits)) #define BTF_INFO_MASK 0x9f00ffff #define BTF_INT_MASK 0x0fffffff #define BTF_TYPE_ID_VALID(type_id) ((type_id) <= BTF_MAX_TYPE) #define BTF_STR_OFFSET_VALID(name_off) ((name_off) <= BTF_MAX_NAME_OFFSET) /* 16MB for 64k structs and each has 16 members and * a few MB spaces for the string section. * The hard limit is S32_MAX. */ #define BTF_MAX_SIZE (16 * 1024 * 1024) #define for_each_member_from(i, from, struct_type, member) \ for (i = from, member = btf_type_member(struct_type) + from; \ i < btf_type_vlen(struct_type); \ i++, member++) #define for_each_vsi_from(i, from, struct_type, member) \ for (i = from, member = btf_type_var_secinfo(struct_type) + from; \ i < btf_type_vlen(struct_type); \ i++, member++) DEFINE_IDR(btf_idr); DEFINE_SPINLOCK(btf_idr_lock); enum btf_kfunc_hook { BTF_KFUNC_HOOK_COMMON, BTF_KFUNC_HOOK_XDP, BTF_KFUNC_HOOK_TC, BTF_KFUNC_HOOK_STRUCT_OPS, BTF_KFUNC_HOOK_TRACING, BTF_KFUNC_HOOK_SYSCALL, BTF_KFUNC_HOOK_FMODRET, BTF_KFUNC_HOOK_CGROUP, BTF_KFUNC_HOOK_SCHED_ACT, BTF_KFUNC_HOOK_SK_SKB, BTF_KFUNC_HOOK_SOCKET_FILTER, BTF_KFUNC_HOOK_LWT, BTF_KFUNC_HOOK_NETFILTER, BTF_KFUNC_HOOK_KPROBE, BTF_KFUNC_HOOK_MAX, }; enum { BTF_KFUNC_SET_MAX_CNT = 256, BTF_DTOR_KFUNC_MAX_CNT = 256, BTF_KFUNC_FILTER_MAX_CNT = 16, }; struct btf_kfunc_hook_filter { btf_kfunc_filter_t filters[BTF_KFUNC_FILTER_MAX_CNT]; u32 nr_filters; }; struct btf_kfunc_set_tab { struct btf_id_set8 *sets[BTF_KFUNC_HOOK_MAX]; struct btf_kfunc_hook_filter hook_filters[BTF_KFUNC_HOOK_MAX]; }; struct btf_id_dtor_kfunc_tab { u32 cnt; struct btf_id_dtor_kfunc dtors[]; }; struct btf_struct_ops_tab { u32 cnt; u32 capacity; struct bpf_struct_ops_desc ops[]; }; struct btf { void *data; struct btf_type **types; u32 *resolved_ids; u32 *resolved_sizes; const char *strings; void *nohdr_data; struct btf_header hdr; u32 nr_types; /* includes VOID for base BTF */ u32 types_size; u32 data_size; refcount_t refcnt; u32 id; struct rcu_head rcu; struct btf_kfunc_set_tab *kfunc_set_tab; struct btf_id_dtor_kfunc_tab *dtor_kfunc_tab; struct btf_struct_metas *struct_meta_tab; struct btf_struct_ops_tab *struct_ops_tab; /* split BTF support */ struct btf *base_btf; u32 start_id; /* first type ID in this BTF (0 for base BTF) */ u32 start_str_off; /* first string offset (0 for base BTF) */ char name[MODULE_NAME_LEN]; bool kernel_btf; __u32 *base_id_map; /* map from distilled base BTF -> vmlinux BTF ids */ }; enum verifier_phase { CHECK_META, CHECK_TYPE, }; struct resolve_vertex { const struct btf_type *t; u32 type_id; u16 next_member; }; enum visit_state { NOT_VISITED, VISITED, RESOLVED, }; enum resolve_mode { RESOLVE_TBD, /* To Be Determined */ RESOLVE_PTR, /* Resolving for Pointer */ RESOLVE_STRUCT_OR_ARRAY, /* Resolving for struct/union * or array */ }; #define MAX_RESOLVE_DEPTH 32 struct btf_sec_info { u32 off; u32 len; }; struct btf_verifier_env { struct btf *btf; u8 *visit_states; struct resolve_vertex stack[MAX_RESOLVE_DEPTH]; struct bpf_verifier_log log; u32 log_type_id; u32 top_stack; enum verifier_phase phase; enum resolve_mode resolve_mode; }; static const char * const btf_kind_str[NR_BTF_KINDS] = { [BTF_KIND_UNKN] = "UNKNOWN", [BTF_KIND_INT] = "INT", [BTF_KIND_PTR] = "PTR", [BTF_KIND_ARRAY] = "ARRAY", [BTF_KIND_STRUCT] = "STRUCT", [BTF_KIND_UNION] = "UNION", [BTF_KIND_ENUM] = "ENUM", [BTF_KIND_FWD] = "FWD", [BTF_KIND_TYPEDEF] = "TYPEDEF", [BTF_KIND_VOLATILE] = "VOLATILE", [BTF_KIND_CONST] = "CONST", [BTF_KIND_RESTRICT] = "RESTRICT", [BTF_KIND_FUNC] = "FUNC", [BTF_KIND_FUNC_PROTO] = "FUNC_PROTO", [BTF_KIND_VAR] = "VAR", [BTF_KIND_DATASEC] = "DATASEC", [BTF_KIND_FLOAT] = "FLOAT", [BTF_KIND_DECL_TAG] = "DECL_TAG", [BTF_KIND_TYPE_TAG] = "TYPE_TAG", [BTF_KIND_ENUM64] = "ENUM64", }; const char *btf_type_str(const struct btf_type *t) { return btf_kind_str[BTF_INFO_KIND(t->info)]; } /* Chunk size we use in safe copy of data to be shown. */ #define BTF_SHOW_OBJ_SAFE_SIZE 32 /* * This is the maximum size of a base type value (equivalent to a * 128-bit int); if we are at the end of our safe buffer and have * less than 16 bytes space we can't be assured of being able * to copy the next type safely, so in such cases we will initiate * a new copy. */ #define BTF_SHOW_OBJ_BASE_TYPE_SIZE 16 /* Type name size */ #define BTF_SHOW_NAME_SIZE 80 /* * The suffix of a type that indicates it cannot alias another type when * comparing BTF IDs for kfunc invocations. */ #define NOCAST_ALIAS_SUFFIX "___init" /* * Common data to all BTF show operations. Private show functions can add * their own data to a structure containing a struct btf_show and consult it * in the show callback. See btf_type_show() below. * * One challenge with showing nested data is we want to skip 0-valued * data, but in order to figure out whether a nested object is all zeros * we need to walk through it. As a result, we need to make two passes * when handling structs, unions and arrays; the first path simply looks * for nonzero data, while the second actually does the display. The first * pass is signalled by show->state.depth_check being set, and if we * encounter a non-zero value we set show->state.depth_to_show to * the depth at which we encountered it. When we have completed the * first pass, we will know if anything needs to be displayed if * depth_to_show > depth. See btf_[struct,array]_show() for the * implementation of this. * * Another problem is we want to ensure the data for display is safe to * access. To support this, the anonymous "struct {} obj" tracks the data * object and our safe copy of it. We copy portions of the data needed * to the object "copy" buffer, but because its size is limited to * BTF_SHOW_OBJ_COPY_LEN bytes, multiple copies may be required as we * traverse larger objects for display. * * The various data type show functions all start with a call to * btf_show_start_type() which returns a pointer to the safe copy * of the data needed (or if BTF_SHOW_UNSAFE is specified, to the * raw data itself). btf_show_obj_safe() is responsible for * using copy_from_kernel_nofault() to update the safe data if necessary * as we traverse the object's data. skbuff-like semantics are * used: * * - obj.head points to the start of the toplevel object for display * - obj.size is the size of the toplevel object * - obj.data points to the current point in the original data at * which our safe data starts. obj.data will advance as we copy * portions of the data. * * In most cases a single copy will suffice, but larger data structures * such as "struct task_struct" will require many copies. The logic in * btf_show_obj_safe() handles the logic that determines if a new * copy_from_kernel_nofault() is needed. */ struct btf_show { u64 flags; void *target; /* target of show operation (seq file, buffer) */ __printf(2, 0) void (*showfn)(struct btf_show *show, const char *fmt, va_list args); const struct btf *btf; /* below are used during iteration */ struct { u8 depth; u8 depth_to_show; u8 depth_check; u8 array_member:1, array_terminated:1; u16 array_encoding; u32 type_id; int status; /* non-zero for error */ const struct btf_type *type; const struct btf_member *member; char name[BTF_SHOW_NAME_SIZE]; /* space for member name/type */ } state; struct { u32 size; void *head; void *data; u8 safe[BTF_SHOW_OBJ_SAFE_SIZE]; } obj; }; struct btf_kind_operations { s32 (*check_meta)(struct btf_verifier_env *env, const struct btf_type *t, u32 meta_left); int (*resolve)(struct btf_verifier_env *env, const struct resolve_vertex *v); int (*check_member)(struct btf_verifier_env *env, const struct btf_type *struct_type, const struct btf_member *member, const struct btf_type *member_type); int (*check_kflag_member)(struct btf_verifier_env *env, const struct btf_type *struct_type, const struct btf_member *member, const struct btf_type *member_type); void (*log_details)(struct btf_verifier_env *env, const struct btf_type *t); void (*show)(const struct btf *btf, const struct btf_type *t, u32 type_id, void *data, u8 bits_offsets, struct btf_show *show); }; static const struct btf_kind_operations * const kind_ops[NR_BTF_KINDS]; static struct btf_type btf_void; static int btf_resolve(struct btf_verifier_env *env, const struct btf_type *t, u32 type_id); static int btf_func_check(struct btf_verifier_env *env, const struct btf_type *t); static bool btf_type_is_modifier(const struct btf_type *t) { /* Some of them is not strictly a C modifier * but they are grouped into the same bucket * for BTF concern: * A type (t) that refers to another * type through t->type AND its size cannot * be determined without following the t->type. * * ptr does not fall into this bucket * because its size is always sizeof(void *). */ switch (BTF_INFO_KIND(t->info)) { case BTF_KIND_TYPEDEF: case BTF_KIND_VOLATILE: case BTF_KIND_CONST: case BTF_KIND_RESTRICT: case BTF_KIND_TYPE_TAG: return true; } return false; } bool btf_type_is_void(const struct btf_type *t) { return t == &btf_void; } static bool btf_type_is_datasec(const struct btf_type *t) { return BTF_INFO_KIND(t->info) == BTF_KIND_DATASEC; } static bool btf_type_is_decl_tag(const struct btf_type *t) { return BTF_INFO_KIND(t->info) == BTF_KIND_DECL_TAG; } static bool btf_type_nosize(const struct btf_type *t) { return btf_type_is_void(t) || btf_type_is_fwd(t) || btf_type_is_func(t) || btf_type_is_func_proto(t) || btf_type_is_decl_tag(t); } static bool btf_type_nosize_or_null(const struct btf_type *t) { return !t || btf_type_nosize(t); } static bool btf_type_is_decl_tag_target(const struct btf_type *t) { return btf_type_is_func(t) || btf_type_is_struct(t) || btf_type_is_var(t) || btf_type_is_typedef(t); } bool btf_is_vmlinux(const struct btf *btf) { return btf->kernel_btf && !btf->base_btf; } u32 btf_nr_types(const struct btf *btf) { u32 total = 0; while (btf) { total += btf->nr_types; btf = btf->base_btf; } return total; } s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind) { const struct btf_type *t; const char *tname; u32 i, total; total = btf_nr_types(btf); for (i = 1; i < total; i++) { t = btf_type_by_id(btf, i); if (BTF_INFO_KIND(t->info) != kind) continue; tname = btf_name_by_offset(btf, t->name_off); if (!strcmp(tname, name)) return i; } return -ENOENT; } s32 bpf_find_btf_id(const char *name, u32 kind, struct btf **btf_p) { struct btf *btf; s32 ret; int id; btf = bpf_get_btf_vmlinux(); if (IS_ERR(btf)) return PTR_ERR(btf); if (!btf) return -EINVAL; ret = btf_find_by_name_kind(btf, name, kind); /* ret is never zero, since btf_find_by_name_kind returns * positive btf_id or negative error. */ if (ret > 0) { btf_get(btf); *btf_p = btf; return ret; } /* If name is not found in vmlinux's BTF then search in module's BTFs */ spin_lock_bh(&btf_idr_lock); idr_for_each_entry(&btf_idr, btf, id) { if (!btf_is_module(btf)) continue; /* linear search could be slow hence unlock/lock * the IDR to avoiding holding it for too long */ btf_get(btf); spin_unlock_bh(&btf_idr_lock); ret = btf_find_by_name_kind(btf, name, kind); if (ret > 0) { *btf_p = btf; return ret; } btf_put(btf); spin_lock_bh(&btf_idr_lock); } spin_unlock_bh(&btf_idr_lock); return ret; } EXPORT_SYMBOL_GPL(bpf_find_btf_id); const struct btf_type *btf_type_skip_modifiers(const struct btf *btf, u32 id, u32 *res_id) { const struct btf_type *t = btf_type_by_id(btf, id); while (btf_type_is_modifier(t)) { id = t->type; t = btf_type_by_id(btf, t->type); } if (res_id) *res_id = id; return t; } const struct btf_type *btf_type_resolve_ptr(const struct btf *btf, u32 id, u32 *res_id) { const struct btf_type *t; t = btf_type_skip_modifiers(btf, id, NULL); if (!btf_type_is_ptr(t)) return NULL; return btf_type_skip_modifiers(btf, t->type, res_id); } const struct btf_type *btf_type_resolve_func_ptr(const struct btf *btf, u32 id, u32 *res_id) { const struct btf_type *ptype; ptype = btf_type_resolve_ptr(btf, id, res_id); if (ptype && btf_type_is_func_proto(ptype)) return ptype; return NULL; } /* Types that act only as a source, not sink or intermediate * type when resolving. */ static bool btf_type_is_resolve_source_only(const struct btf_type *t) { return btf_type_is_var(t) || btf_type_is_decl_tag(t) || btf_type_is_datasec(t); } /* What types need to be resolved? * * btf_type_is_modifier() is an obvious one. * * btf_type_is_struct() because its member refers to * another type (through member->type). * * btf_type_is_var() because the variable refers to * another type. btf_type_is_datasec() holds multiple * btf_type_is_var() types that need resolving. * * btf_type_is_array() because its element (array->type) * refers to another type. Array can be thought of a * special case of struct while array just has the same * member-type repeated by array->nelems of times. */ static bool btf_type_needs_resolve(const struct btf_type *t) { return btf_type_is_modifier(t) || btf_type_is_ptr(t) || btf_type_is_struct(t) || btf_type_is_array(t) || btf_type_is_var(t) || btf_type_is_func(t) || btf_type_is_decl_tag(t) || btf_type_is_datasec(t); } /* t->size can be used */ static bool btf_type_has_size(const struct btf_type *t) { switch (BTF_INFO_KIND(t->info)) { case BTF_KIND_INT: case BTF_KIND_STRUCT: case BTF_KIND_UNION: case BTF_KIND_ENUM: case BTF_KIND_DATASEC: case BTF_KIND_FLOAT: case BTF_KIND_ENUM64: return true; } return false; } static const char *btf_int_encoding_str(u8 encoding) { if (encoding == 0) return "(none)"; else if (encoding == BTF_INT_SIGNED) return "SIGNED"; else if (encoding == BTF_INT_CHAR) return "CHAR"; else if (encoding == BTF_INT_BOOL) return "BOOL"; else return "UNKN"; } static u32 btf_type_int(const struct btf_type *t) { return *(u32 *)(t + 1); } static const struct btf_array *btf_type_array(const struct btf_type *t) { return (const struct btf_array *)(t + 1); } static const struct btf_enum *btf_type_enum(const struct btf_type *t) { return (const struct btf_enum *)(t + 1); } static const struct btf_var *btf_type_var(const struct btf_type *t) { return (const struct btf_var *)(t + 1); } static const struct btf_decl_tag *btf_type_decl_tag(const struct btf_type *t) { return (const struct btf_decl_tag *)(t + 1); } static const struct btf_enum64 *btf_type_enum64(const struct btf_type *t) { return (const struct btf_enum64 *)(t + 1); } static const struct btf_kind_operations *btf_type_ops(const struct btf_type *t) { return kind_ops[BTF_INFO_KIND(t->info)]; } static bool btf_name_offset_valid(const struct btf *btf, u32 offset) { if (!BTF_STR_OFFSET_VALID(offset)) return false; while (offset < btf->start_str_off) btf = btf->base_btf; offset -= btf->start_str_off; return offset < btf->hdr.str_len; } static bool __btf_name_char_ok(char c, bool first) { if ((first ? !isalpha(c) : !isalnum(c)) && c != '_' && c != '.') return false; return true; } const char *btf_str_by_offset(const struct btf *btf, u32 offset) { while (offset < btf->start_str_off) btf = btf->base_btf; offset -= btf->start_str_off; if (offset < btf->hdr.str_len) return &btf->strings[offset]; return NULL; } static bool btf_name_valid_identifier(const struct btf *btf, u32 offset) { /* offset must be valid */ const char *src = btf_str_by_offset(btf, offset); const char *src_limit; if (!__btf_name_char_ok(*src, true)) return false; /* set a limit on identifier length */ src_limit = src + KSYM_NAME_LEN; src++; while (*src && src < src_limit) { if (!__btf_name_char_ok(*src, false)) return false; src++; } return !*src; } /* Allow any printable character in DATASEC names */ static bool btf_name_valid_section(const struct btf *btf, u32 offset) { /* offset must be valid */ const char *src = btf_str_by_offset(btf, offset); const char *src_limit; if (!*src) return false; /* set a limit on identifier length */ src_limit = src + KSYM_NAME_LEN; while (*src && src < src_limit) { if (!isprint(*src)) return false; src++; } return !*src; } static const char *__btf_name_by_offset(const struct btf *btf, u32 offset) { const char *name; if (!offset) return "(anon)"; name = btf_str_by_offset(btf, offset); return name ?: "(invalid-name-offset)"; } const char *btf_name_by_offset(const struct btf *btf, u32 offset) { return btf_str_by_offset(btf, offset); } const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id) { while (type_id < btf->start_id) btf = btf->base_btf; type_id -= btf->start_id; if (type_id >= btf->nr_types) return NULL; return btf->types[type_id]; } EXPORT_SYMBOL_GPL(btf_type_by_id); /* * Check that the type @t is a regular int. This means that @t is not * a bit field and it has the same size as either of u8/u16/u32/u64 * or __int128. If @expected_size is not zero, then size of @t should * be the same. A caller should already have checked that the type @t * is an integer. */ static bool __btf_type_int_is_regular(const struct btf_type *t, size_t expected_size) { u32 int_data = btf_type_int(t); u8 nr_bits = BTF_INT_BITS(int_data); u8 nr_bytes = BITS_ROUNDUP_BYTES(nr_bits); return BITS_PER_BYTE_MASKED(nr_bits) == 0 && BTF_INT_OFFSET(int_data) == 0 && (nr_bytes <= 16 && is_power_of_2(nr_bytes)) && (expected_size == 0 || nr_bytes == expected_size); } static bool btf_type_int_is_regular(const struct btf_type *t) { return __btf_type_int_is_regular(t, 0); } bool btf_type_is_i32(const struct btf_type *t) { return btf_type_is_int(t) && __btf_type_int_is_regular(t, 4); } bool btf_type_is_i64(const struct btf_type *t) { return btf_type_is_int(t) && __btf_type_int_is_regular(t, 8); } bool btf_type_is_primitive(const struct btf_type *t) { return (btf_type_is_int(t) && btf_type_int_is_regular(t)) || btf_is_any_enum(t); } /* * Check that given struct member is a regular int with expected * offset and size. */ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s, const struct btf_member *m, u32 expected_offset, u32 expected_size) { const struct btf_type *t; u32 id, int_data; u8 nr_bits; id = m->type; t = btf_type_id_size(btf, &id, NULL); if (!t || !btf_type_is_int(t)) return false; int_data = btf_type_int(t); nr_bits = BTF_INT_BITS(int_data); if (btf_type_kflag(s)) { u32 bitfield_size = BTF_MEMBER_BITFIELD_SIZE(m->offset); u32 bit_offset = BTF_MEMBER_BIT_OFFSET(m->offset); /* if kflag set, int should be a regular int and * bit offset should be at byte boundary. */ return !bitfield_size && BITS_ROUNDUP_BYTES(bit_offset) == expected_offset && BITS_ROUNDUP_BYTES(nr_bits) == expected_size; } if (BTF_INT_OFFSET(int_data) || BITS_PER_BYTE_MASKED(m->offset) || BITS_ROUNDUP_BYTES(m->offset) != expected_offset || BITS_PER_BYTE_MASKED(nr_bits) || BITS_ROUNDUP_BYTES(nr_bits) != expected_size) return false; return true; } /* Similar to btf_type_skip_modifiers() but does not skip typedefs. */ static const struct btf_type *btf_type_skip_qualifiers(const struct btf *btf, u32 id) { const struct btf_type *t = btf_type_by_id(btf, id); while (btf_type_is_modifier(t) && BTF_INFO_KIND(t->info) != BTF_KIND_TYPEDEF) { t = btf_type_by_id(btf, t->type); } return t; } #define BTF_SHOW_MAX_ITER 10 #define BTF_KIND_BIT(kind) (1ULL << kind) /* * Populate show->state.name with type name information. * Format of type name is * * [.member_name = ] (type_name) */ static const char *btf_show_name(struct btf_show *show) { /* BTF_MAX_ITER array suffixes "[]" */ const char *array_suffixes = "[][][][][][][][][][]"; const char *array_suffix = &array_suffixes[strlen(array_suffixes)]; /* BTF_MAX_ITER pointer suffixes "*" */ const char *ptr_suffixes = "**********"; const char *ptr_suffix = &ptr_suffixes[strlen(ptr_suffixes)]; const char *name = NULL, *prefix = "", *parens = ""; const struct btf_member *m = show->state.member; const struct btf_type *t; const struct btf_array *array; u32 id = show->state.type_id; const char *member = NULL; bool show_member = false; u64 kinds = 0; int i; show->state.name[0] = '\0'; /* * Don't show type name if we're showing an array member; * in that case we show the array type so don't need to repeat * ourselves for each member. */ if (show->state.array_member) return ""; /* Retrieve member name, if any. */ if (m) { member = btf_name_by_offset(show->btf, m->name_off); show_member = strlen(member) > 0; id = m->type; } /* * Start with type_id, as we have resolved the struct btf_type * * via btf_modifier_show() past the parent typedef to the child * struct, int etc it is defined as. In such cases, the type_id * still represents the starting type while the struct btf_type * * in our show->state points at the resolved type of the typedef. */ t = btf_type_by_id(show->btf, id); if (!t) return ""; /* * The goal here is to build up the right number of pointer and * array suffixes while ensuring the type name for a typedef * is represented. Along the way we accumulate a list of * BTF kinds we have encountered, since these will inform later * display; for example, pointer types will not require an * opening "{" for struct, we will just display the pointer value. * * We also want to accumulate the right number of pointer or array * indices in the format string while iterating until we get to * the typedef/pointee/array member target type. * * We start by pointing at the end of pointer and array suffix * strings; as we accumulate pointers and arrays we move the pointer * or array string backwards so it will show the expected number of * '*' or '[]' for the type. BTF_SHOW_MAX_ITER of nesting of pointers * and/or arrays and typedefs are supported as a precaution. * * We also want to get typedef name while proceeding to resolve * type it points to so that we can add parentheses if it is a * "typedef struct" etc. */ for (i = 0; i < BTF_SHOW_MAX_ITER; i++) { switch (BTF_INFO_KIND(t->info)) { case BTF_KIND_TYPEDEF: if (!name) name = btf_name_by_offset(show->btf, t->name_off); kinds |= BTF_KIND_BIT(BTF_KIND_TYPEDEF); id = t->type; break; case BTF_KIND_ARRAY: kinds |= BTF_KIND_BIT(BTF_KIND_ARRAY); parens = "["; if (!t) return ""; array = btf_type_array(t); if (array_suffix > array_suffixes) array_suffix -= 2; id = array->type; break; case BTF_KIND_PTR: kinds |= BTF_KIND_BIT(BTF_KIND_PTR); if (ptr_suffix > ptr_suffixes) ptr_suffix -= 1; id = t->type; break; default: id = 0; break; } if (!id) break; t = btf_type_skip_qualifiers(show->btf, id); } /* We may not be able to represent this type; bail to be safe */ if (i == BTF_SHOW_MAX_ITER) return ""; if (!name) name = btf_name_by_offset(show->btf, t->name_off); switch (BTF_INFO_KIND(t->info)) { case BTF_KIND_STRUCT: case BTF_KIND_UNION: prefix = BTF_INFO_KIND(t->info) == BTF_KIND_STRUCT ? "struct" : "union"; /* if it's an array of struct/union, parens is already set */ if (!(kinds & (BTF_KIND_BIT(BTF_KIND_ARRAY)))) parens = "{"; break; case BTF_KIND_ENUM: case BTF_KIND_ENUM64: prefix = "enum"; break; default: break; } /* pointer does not require parens */ if (kinds & BTF_KIND_BIT(BTF_KIND_PTR)) parens = ""; /* typedef does not require struct/union/enum prefix */ if (kinds & BTF_KIND_BIT(BTF_KIND_TYPEDEF)) prefix = ""; if (!name) name = ""; /* Even if we don't want type name info, we want parentheses etc */ if (show->flags & BTF_SHOW_NONAME) snprintf(show->state.name, sizeof(show->state.name), "%s", parens); else snprintf(show->state.name, sizeof(show->state.name), "%s%s%s(%s%s%s%s%s%s)%s", /* first 3 strings comprise ".member = " */ show_member ? "." : "", show_member ? member : "", show_member ? " = " : "", /* ...next is our prefix (struct, enum, etc) */ prefix, strlen(prefix) > 0 && strlen(name) > 0 ? " " : "", /* ...this is the type name itself */ name, /* ...suffixed by the appropriate '*', '[]' suffixes */ strlen(ptr_suffix) > 0 ? " " : "", ptr_suffix, array_suffix, parens); return show->state.name; } static const char *__btf_show_indent(struct btf_show *show) { const char *indents = " "; const char *indent = &indents[strlen(indents)]; if ((indent - show->state.depth) >= indents) return indent - show->state.depth; return indents; } static const char *btf_show_indent(struct btf_show *show) { return show->flags & BTF_SHOW_COMPACT ? "" : __btf_show_indent(show); } static const char *btf_show_newline(struct btf_show *show) { return show->flags & BTF_SHOW_COMPACT ? "" : "\n"; } static const char *btf_show_delim(struct btf_show *show) { if (show->state.depth == 0) return ""; if ((show->flags & BTF_SHOW_COMPACT) && show->state.type && BTF_INFO_KIND(show->state.type->info) == BTF_KIND_UNION) return "|"; return ","; } __printf(2, 3) static void btf_show(struct btf_show *show, const char *fmt, ...) { va_list args; if (!show->state.depth_check) { va_start(args, fmt); show->showfn(show, fmt, args); va_end(args); } } /* Macros are used here as btf_show_type_value[s]() prepends and appends * format specifiers to the format specifier passed in; these do the work of * adding indentation, delimiters etc while the caller simply has to specify * the type value(s) in the format specifier + value(s). */ #define btf_show_type_value(show, fmt, value) \ do { \ if ((value) != (__typeof__(value))0 || \ (show->flags & BTF_SHOW_ZERO) || \ show->state.depth == 0) { \ btf_show(show, "%s%s" fmt "%s%s", \ btf_show_indent(show), \ btf_show_name(show), \ value, btf_show_delim(show), \ btf_show_newline(show)); \ if (show->state.depth > show->state.depth_to_show) \ show->state.depth_to_show = show->state.depth; \ } \ } while (0) #define btf_show_type_values(show, fmt, ...) \ do { \ btf_show(show, "%s%s" fmt "%s%s", btf_show_indent(show), \ btf_show_name(show), \ __VA_ARGS__, btf_show_delim(show), \ btf_show_newline(show)); \ if (show->state.depth > show->state.depth_to_show) \ show->state.depth_to_show = show->state.depth; \ } while (0) /* How much is left to copy to safe buffer after @data? */ static int btf_show_obj_size_left(struct btf_show *show, void *data) { return show->obj.head + show->obj.size - data; } /* Is object pointed to by @data of @size already copied to our safe buffer? */ static bool btf_show_obj_is_safe(struct btf_show *show, void *data, int size) { return data >= show->obj.data && (data + size) < (show->obj.data + BTF_SHOW_OBJ_SAFE_SIZE); } /* * If object pointed to by @data of @size falls within our safe buffer, return * the equivalent pointer to the same safe data. Assumes * copy_from_kernel_nofault() has already happened and our safe buffer is * populated. */ static void *__btf_show_obj_safe(struct btf_show *show, void *data, int size) { if (btf_show_obj_is_safe(show, data, size)) return show->obj.safe + (data - show->obj.data); return NULL; } /* * Return a safe-to-access version of data pointed to by @data. * We do this by copying the relevant amount of information * to the struct btf_show obj.safe buffer using copy_from_kernel_nofault(). * * If BTF_SHOW_UNSAFE is specified, just return data as-is; no * safe copy is needed. * * Otherwise we need to determine if we have the required amount * of data (determined by the @data pointer and the size of the * largest base type we can encounter (represented by * BTF_SHOW_OBJ_BASE_TYPE_SIZE). Having that much data ensures * that we will be able to print some of the current object, * and if more is needed a copy will be triggered. * Some objects such as structs will not fit into the buffer; * in such cases additional copies when we iterate over their * members may be needed. * * btf_show_obj_safe() is used to return a safe buffer for * btf_show_start_type(); this ensures that as we recurse into * nested types we always have safe data for the given type. * This approach is somewhat wasteful; it's possible for example * that when iterating over a large union we'll end up copying the * same data repeatedly, but the goal is safety not performance. * We use stack data as opposed to per-CPU buffers because the * iteration over a type can take some time, and preemption handling * would greatly complicate use of the safe buffer. */ static void *btf_show_obj_safe(struct btf_show *show, const struct btf_type *t, void *data) { const struct btf_type *rt; int size_left, size; void *safe = NULL; if (show->flags & BTF_SHOW_UNSAFE) return data; rt = btf_resolve_size(show->btf, t, &size); if (IS_ERR(rt)) { show->state.status = PTR_ERR(rt); return NULL; } /* * Is this toplevel object? If so, set total object size and * initialize pointers. Otherwise check if we still fall within * our safe object data. */ if (show->state.depth == 0) { show->obj.size = size; show->obj.head = data; } else { /* * If the size of the current object is > our remaining * safe buffer we _may_ need to do a new copy. However * consider the case of a nested struct; it's size pushes * us over the safe buffer limit, but showing any individual * struct members does not. In such cases, we don't need * to initiate a fresh copy yet; however we definitely need * at least BTF_SHOW_OBJ_BASE_TYPE_SIZE bytes left * in our buffer, regardless of the current object size. * The logic here is that as we resolve types we will * hit a base type at some point, and we need to be sure * the next chunk of data is safely available to display * that type info safely. We cannot rely on the size of * the current object here because it may be much larger * than our current buffer (e.g. task_struct is 8k). * All we want to do here is ensure that we can print the * next basic type, which we can if either * - the current type size is within the safe buffer; or * - at least BTF_SHOW_OBJ_BASE_TYPE_SIZE bytes are left in * the safe buffer. */ safe = __btf_show_obj_safe(show, data, min(size, BTF_SHOW_OBJ_BASE_TYPE_SIZE)); } /* * We need a new copy to our safe object, either because we haven't * yet copied and are initializing safe data, or because the data * we want falls outside the boundaries of the safe object. */ if (!safe) { size_left = btf_show_obj_size_left(show, data); if (size_left > BTF_SHOW_OBJ_SAFE_SIZE) size_left = BTF_SHOW_OBJ_SAFE_SIZE; show->state.status = copy_from_kernel_nofault(show->obj.safe, data, size_left); if (!show->state.status) { show->obj.data = data; safe = show->obj.safe; } } return safe; } /* * Set the type we are starting to show and return a safe data pointer * to be used for showing the associated data. */ static void *btf_show_start_type(struct btf_show *show, const struct btf_type *t, u32 type_id, void *data) { show->state.type = t; show->state.type_id = type_id; show->state.name[0] = '\0'; return btf_show_obj_safe(show, t, data); } static void btf_show_end_type(struct btf_show *show) { show->state.type = NULL; show->state.type_id = 0; show->state.name[0] = '\0'; } static void *btf_show_start_aggr_type(struct btf_show *show, const struct btf_type *t, u32 type_id, void *data) { void *safe_data = btf_show_start_type(show, t, type_id, data); if (!safe_data) return safe_data; btf_show(show, "%s%s%s", btf_show_indent(show), btf_show_name(show), btf_show_newline(show)); show->state.depth++; return safe_data; } static void btf_show_end_aggr_type(struct btf_show *show, const char *suffix) { show->state.depth--; btf_show(show, "%s%s%s%s", btf_show_indent(show), suffix, btf_show_delim(show), btf_show_newline(show)); btf_show_end_type(show); } static void btf_show_start_member(struct btf_show *show, const struct btf_member *m) { show->state.member = m; } static void btf_show_start_array_member(struct btf_show *show) { show->state.array_member = 1; btf_show_start_member(show, NULL); } static void btf_show_end_member(struct btf_show *show) { show->state.member = NULL; } static void btf_show_end_array_member(struct btf_show *show) { show->state.array_member = 0; btf_show_end_member(show); } static void *btf_show_start_array_type(struct btf_show *show, const struct btf_type *t, u32 type_id, u16 array_encoding, void *data) { show->state.array_encoding = array_encoding; show->state.array_terminated = 0; return btf_show_start_aggr_type(show, t, type_id, data); } static void btf_show_end_array_type(struct btf_show *show) { show->state.array_encoding = 0; show->state.array_terminated = 0; btf_show_end_aggr_type(show, "]"); } static void *btf_show_start_struct_type(struct btf_show *show, const struct btf_type *t, u32 type_id, void *data) { return btf_show_start_aggr_type(show, t, type_id, data); } static void btf_show_end_struct_type(struct btf_show *show) { btf_show_end_aggr_type(show, "}"); } __printf(2, 3) static void __btf_verifier_log(struct bpf_verifier_log *log, const char *fmt, ...) { va_list args; va_start(args, fmt); bpf_verifier_vlog(log, fmt, args); va_end(args); } __printf(2, 3) static void btf_verifier_log(struct btf_verifier_env *env, const char *fmt, ...) { struct bpf_verifier_log *log = &env->log; va_list args; if (!bpf_verifier_log_needed(log)) return; va_start(args, fmt); bpf_verifier_vlog(log, fmt, args); va_end(args); } __printf(4, 5) static void __btf_verifier_log_type(struct btf_verifier_env *env, const struct btf_type *t, bool log_details, const char *fmt, ...) { struct bpf_verifier_log *log = &env->log; struct btf *btf = env->btf; va_list args; if (!bpf_verifier_log_needed(log)) return; if (log->level == BPF_LOG_KERNEL) { /* btf verifier prints all types it is processing via * btf_verifier_log_type(..., fmt = NULL). * Skip those prints for in-kernel BTF verification. */ if (!fmt) return; /* Skip logging when loading module BTF with mismatches permitted */ if (env->btf->base_btf && IS_ENABLED(CONFIG_MODULE_ALLOW_BTF_MISMATCH)) return; } __btf_verifier_log(log, "[%u] %s %s%s", env->log_type_id, btf_type_str(t), __btf_name_by_offset(btf, t->name_off), log_details ? " " : ""); if (log_details) btf_type_ops(t)->log_details(env, t); if (fmt && *fmt) { __btf_verifier_log(log, " "); va_start(args, fmt); bpf_verifier_vlog(log, fmt, args); va_end(args); } __btf_verifier_log(log, "\n"); } #define btf_verifier_log_type(env, t, ...) \ __btf_verifier_log_type((env), (t), true, __VA_ARGS__) #define btf_verifier_log_basic(env, t, ...) \ __btf_verifier_log_type((env), (t), false, __VA_ARGS__) __printf(4, 5) static void btf_verifier_log_member(struct btf_verifier_env *env, const struct btf_type *struct_type, const struct btf_member *member, const char *fmt, ...) { struct bpf_verifier_log *log = &env->log; struct btf *btf = env->btf; va_list args; if (!bpf_verifier_log_needed(log)) return; if (log->level == BPF_LOG_KERNEL) { if (!fmt) return; /* Skip logging when loading module BTF with mismatches permitted */ if (env->btf->base_btf && IS_ENABLED(CONFIG_MODULE_ALLOW_BTF_MISMATCH)) return; } /* The CHECK_META phase already did a btf dump. * * If member is logged again, it must hit an error in * parsing this member. It is useful to print out which * struct this member belongs to. */ if (env->phase != CHECK_META) btf_verifier_log_type(env, struct_type, NULL); if (btf_type_kflag(struct_type)) __btf_verifier_log(log, "\t%s type_id=%u bitfield_size=%u bits_offset=%u", __btf_name_by_offset(btf, member->name_off), member->type, BTF_MEMBER_BITFIELD_SIZE(member->offset), BTF_MEMBER_BIT_OFFSET(member->offset)); else __btf_verifier_log(log, "\t%s type_id=%u bits_offset=%u", __btf_name_by_offset(btf, member->name_off), member->type, member->offset); if (fmt && *fmt) { __btf_verifier_log(log, " "); va_start(args, fmt); bpf_verifier_vlog(log, fmt, args); va_end(args); } __btf_verifier_log(log, "\n"); } __printf(4, 5) static void btf_verifier_log_vsi(struct btf_verifier_env *env, const struct btf_type *datasec_type, const struct btf_var_secinfo *vsi, const char *fmt, ...) { struct bpf_verifier_log *log = &env->log; va_list args; if (!bpf_verifier_log_needed(log)) return; if (log->level == BPF_LOG_KERNEL && !fmt) return; if (env->phase != CHECK_META) btf_verifier_log_type(env, datasec_type, NULL); __btf_verifier_log(log, "\t type_id=%u offset=%u size=%u", vsi->type, vsi->offset, vsi->size); if (fmt && *fmt) { __btf_verifier_log(log, " "); va_start(args, fmt); bpf_verifier_vlog(log, fmt, args); va_end(args); } __btf_verifier_log(log, "\n"); } static void btf_verifier_log_hdr(struct btf_verifier_env *env, u32 btf_data_size) { struct bpf_verifier_log *log = &env->log; const struct btf *btf = env->btf; const struct btf_header *hdr; if (!bpf_verifier_log_needed(log)) return; if (log->level == BPF_LOG_KERNEL) return; hdr = &btf->hdr; __btf_verifier_log(log, "magic: 0x%x\n", hdr->magic); __btf_verifier_log(log, "version: %u\n", hdr->version); __btf_verifier_log(log, "flags: 0x%x\n", hdr->flags); __btf_verifier_log(log, "hdr_len: %u\n", hdr->hdr_len); __btf_verifier_log(log, "type_off: %u\n", hdr->type_off); __btf_verifier_log(log, "type_len: %u\n", hdr->type_len); __btf_verifier_log(log, "str_off: %u\n", hdr->str_off); __btf_verifier_log(log, "str_len: %u\n", hdr->str_len); __btf_verifier_log(log, "btf_total_size: %u\n", btf_data_size); } static int btf_add_type(struct btf_verifier_env *env, struct btf_type *t) { struct btf *btf = env->btf; if (btf->types_size == btf->nr_types) { /* Expand 'types' array */ struct btf_type **new_types; u32 expand_by, new_size; if (btf->start_id + btf->types_size == BTF_MAX_TYPE) { btf_verifier_log(env, "Exceeded max num of types"); return -E2BIG; } expand_by = max_t(u32, btf->types_size >> 2, 16); new_size = min_t(u32, BTF_MAX_TYPE, btf->types_size + expand_by); new_types = kvcalloc(new_size, sizeof(*new_types), GFP_KERNEL | __GFP_NOWARN); if (!new_types) return -ENOMEM; if (btf->nr_types == 0) { if (!btf->base_btf) { /* lazily init VOID type */ new_types[0] = &btf_void; btf->nr_types++; } } else { memcpy(new_types, btf->types, sizeof(*btf->types) * btf->nr_types); } kvfree(btf->types); btf->types = new_types; btf->types_size = new_size; } btf->types[btf->nr_types++] = t; return 0; } static int btf_alloc_id(struct btf *btf) { int id; idr_preload(GFP_KERNEL); spin_lock_bh(&btf_idr_lock); id = idr_alloc_cyclic(&btf_idr, btf, 1, INT_MAX, GFP_ATOMIC); if (id > 0) btf->id = id; spin_unlock_bh(&btf_idr_lock); idr_preload_end(); if (WARN_ON_ONCE(!id)) return -ENOSPC; return id > 0 ? 0 : id; } static void btf_free_id(struct btf *btf) { unsigned long flags; /* * In map-in-map, calling map_delete_elem() on outer * map will call bpf_map_put on the inner map. * It will then eventually call btf_free_id() * on the inner map. Some of the map_delete_elem() * implementation may have irq disabled, so * we need to use the _irqsave() version instead * of the _bh() version. */ spin_lock_irqsave(&btf_idr_lock, flags); idr_remove(&btf_idr, btf->id); spin_unlock_irqrestore(&btf_idr_lock, flags); } static void btf_free_kfunc_set_tab(struct btf *btf) { struct btf_kfunc_set_tab *tab = btf->kfunc_set_tab; int hook; if (!tab) return; for (hook = 0; hook < ARRAY_SIZE(tab->sets); hook++) kfree(tab->sets[hook]); kfree(tab); btf->kfunc_set_tab = NULL; } static void btf_free_dtor_kfunc_tab(struct btf *btf) { struct btf_id_dtor_kfunc_tab *tab = btf->dtor_kfunc_tab; if (!tab) return; kfree(tab); btf->dtor_kfunc_tab = NULL; } static void btf_struct_metas_free(struct btf_struct_metas *tab) { int i; if (!tab) return; for (i = 0; i < tab->cnt; i++) btf_record_free(tab->types[i].record); kfree(tab); } static void btf_free_struct_meta_tab(struct btf *btf) { struct btf_struct_metas *tab = btf->struct_meta_tab; btf_struct_metas_free(tab); btf->struct_meta_tab = NULL; } static void btf_free_struct_ops_tab(struct btf *btf) { struct btf_struct_ops_tab *tab = btf->struct_ops_tab; u32 i; if (!tab) return; for (i = 0; i < tab->cnt; i++) bpf_struct_ops_desc_release(&tab->ops[i]); kfree(tab); btf->struct_ops_tab = NULL; } static void btf_free(struct btf *btf) { btf_free_struct_meta_tab(btf); btf_free_dtor_kfunc_tab(btf); btf_free_kfunc_set_tab(btf); btf_free_struct_ops_tab(btf); kvfree(btf->types); kvfree(btf->resolved_sizes); kvfree(btf->resolved_ids); /* vmlinux does not allocate btf->data, it simply points it at * __start_BTF. */ if (!btf_is_vmlinux(btf)) kvfree(btf->data); kvfree(btf->base_id_map); kfree(btf); } static void btf_free_rcu(struct rcu_head *rcu) { struct btf *btf = container_of(rcu, struct btf, rcu); btf_free(btf); } const char *btf_get_name(const struct btf *btf) { return btf->name; } void btf_get(struct btf *btf) { refcount_inc(&btf->refcnt); } void btf_put(struct btf *btf) { if (btf && refcount_dec_and_test(&btf->refcnt)) { btf_free_id(btf); call_rcu(&btf->rcu, btf_free_rcu); } } struct btf *btf_base_btf(const struct btf *btf) { return btf->base_btf; } const struct btf_header *btf_header(const struct btf *btf) { return &btf->hdr; } void btf_set_base_btf(struct btf *btf, const struct btf *base_btf) { btf->base_btf = (struct btf *)base_btf; btf->start_id = btf_nr_types(base_btf); btf->start_str_off = base_btf->hdr.str_len; } static int env_resolve_init(struct btf_verifier_env *env) { struct btf *btf = env->btf; u32 nr_types = btf->nr_types; u32 *resolved_sizes = NULL; u32 *resolved_ids = NULL; u8 *visit_states = NULL; resolved_sizes = kvcalloc(nr_types, sizeof(*resolved_sizes), GFP_KERNEL | __GFP_NOWARN); if (!resolved_sizes) goto nomem; resolved_ids = kvcalloc(nr_types, sizeof(*resolved_ids), GFP_KERNEL | __GFP_NOWARN); if (!resolved_ids) goto nomem; visit_states = kvcalloc(nr_types, sizeof(*visit_states), GFP_KERNEL | __GFP_NOWARN); if (!visit_states) goto nomem; btf->resolved_sizes = resolved_sizes; btf->resolved_ids = resolved_ids; env->visit_states = visit_states; return 0; nomem: kvfree(resolved_sizes); kvfree(resolved_ids); kvfree(visit_states); return -ENOMEM; } static void btf_verifier_env_free(struct btf_verifier_env *env) { kvfree(env->visit_states); kfree(env); } static bool env_type_is_resolve_sink(const struct btf_verifier_env *env, const struct btf_type *next_type) { switch (env->resolve_mode) { case RESOLVE_TBD: /* int, enum or void is a sink */ return !btf_type_needs_resolve(next_type); case RESOLVE_PTR: /* int, enum, void, struct, array, func or func_proto is a sink * for ptr */ return !btf_type_is_modifier(next_type) && !btf_type_is_ptr(next_type); case RESOLVE_STRUCT_OR_ARRAY: /* int, enum, void, ptr, func or func_proto is a sink * for struct and array */ return !btf_type_is_modifier(next_type) && !btf_type_is_array(next_type) && !btf_type_is_struct(next_type); default: BUG(); } } static bool env_type_is_resolved(const struct btf_verifier_env *env, u32 type_id) { /* base BTF types should be resolved by now */ if (type_id < env->btf->start_id) return true; return env->visit_states[type_id - env->btf->start_id] == RESOLVED; } static int env_stack_push(struct btf_verifier_env *env, const struct btf_type *t, u32 type_id) { const struct btf *btf = env->btf; struct resolve_vertex *v; if (env->top_stack == MAX_RESOLVE_DEPTH) return -E2BIG; if (type_id < btf->start_id || env->visit_states[type_id - btf->start_id] != NOT_VISITED) return -EEXIST; env->visit_states[type_id - btf->start_id] = VISITED; v = &env->stack[env->top_stack++]; v->t = t; v->type_id = type_id; v->next_member = 0; if (env->resolve_mode == RESOLVE_TBD) { if (btf_type_is_ptr(t)) env->resolve_mode = RESOLVE_PTR; else if (btf_type_is_struct(t) || btf_type_is_array(t)) env->resolve_mode = RESOLVE_STRUCT_OR_ARRAY; } return 0; } static void env_stack_set_next_member(struct btf_verifier_env *env, u16 next_member) { env->stack[env->top_stack - 1].next_member = next_member; } static void env_stack_pop_resolved(struct btf_verifier_env *env, u32 resolved_type_id, u32 resolved_size) { u32 type_id = env->stack[--(env->top_stack)].type_id; struct btf *btf = env->btf; type_id -= btf->start_id; /* adjust to local type id */ btf->resolved_sizes[type_id] = resolved_size; btf->resolved_ids[type_id] = resolved_type_id; env->visit_states[type_id] = RESOLVED; } static const struct resolve_vertex *env_stack_peak(struct btf_verifier_env *env) { return env->top_stack ? &env->stack[env->top_stack - 1] : NULL; } /* Resolve the size of a passed-in "type" * * type: is an array (e.g. u32 array[x][y]) * return type: type "u32[x][y]", i.e. BTF_KIND_ARRAY, * *type_size: (x * y * sizeof(u32)). Hence, *type_size always * corresponds to the return type. * *elem_type: u32 * *elem_id: id of u32 * *total_nelems: (x * y). Hence, individual elem size is * (*type_size / *total_nelems) * *type_id: id of type if it's changed within the function, 0 if not * * type: is not an array (e.g. const struct X) * return type: type "struct X" * *type_size: sizeof(struct X) * *elem_type: same as return type ("struct X") * *elem_id: 0 * *total_nelems: 1 * *type_id: id of type if it's changed within the function, 0 if not */ static const struct btf_type * __btf_resolve_size(const struct btf *btf, const struct btf_type *type, u32 *type_size, const struct btf_type **elem_type, u32 *elem_id, u32 *total_nelems, u32 *type_id) { const struct btf_type *array_type = NULL; const struct btf_array *array = NULL; u32 i, size, nelems = 1, id = 0; for (i = 0; i < MAX_RESOLVE_DEPTH; i++) { switch (BTF_INFO_KIND(type->info)) { /* type->size can be used */ case BTF_KIND_INT: case BTF_KIND_STRUCT: case BTF_KIND_UNION: case BTF_KIND_ENUM: case BTF_KIND_FLOAT: case BTF_KIND_ENUM64: size = type->size; goto resolved; case BTF_KIND_PTR: size = sizeof(void *); goto resolved; /* Modifiers */ case BTF_KIND_TYPEDEF: case BTF_KIND_VOLATILE: case BTF_KIND_CONST: case BTF_KIND_RESTRICT: case BTF_KIND_TYPE_TAG: id = type->type; type = btf_type_by_id(btf, type->type); break; case BTF_KIND_ARRAY: if (!array_type) array_type = type; array = btf_type_array(type); if (nelems && array->nelems > U32_MAX / nelems) return ERR_PTR(-EINVAL); nelems *= array->nelems; type = btf_type_by_id(btf, array->type); break; /* type without size */ default: return ERR_PTR(-EINVAL); } } return ERR_PTR(-EINVAL); resolved: if (nelems && size > U32_MAX / nelems) return ERR_PTR(-EINVAL); *type_size = nelems * size; if (total_nelems) *total_nelems = nelems; if (elem_type) *elem_type = type; if (elem_id) *elem_id = array ? array->type : 0; if (type_id && id) *type_id = id; return array_type ? : type; } const struct btf_type * btf_resolve_size(const struct btf *btf, const struct btf_type *type, u32 *type_size) { return __btf_resolve_size(btf, type, type_size, NULL, NULL, NULL, NULL); } static u32 btf_resolved_type_id(const struct btf *btf, u32 type_id) { while (type_id < btf->start_id) btf = btf->base_btf; return btf->resolved_ids[type_id - btf->start_id]; } /* The input param "type_id" must point to a needs_resolve type */ static const struct btf_type *btf_type_id_resolve(const struct btf *btf, u32 *type_id) { *type_id = btf_resolved_type_id(btf, *type_id); return btf_type_by_id(btf, *type_id); } static u32 btf_resolved_type_size(const struct btf *btf, u32 type_id) { while (type_id < btf->start_id) btf = btf->base_btf; return btf->resolved_sizes[type_id - btf->start_id]; } const struct btf_type *btf_type_id_size(const struct btf *btf, u32 *type_id, u32 *ret_size) { const struct btf_type *size_type; u32 size_type_id = *type_id; u32 size = 0; size_type = btf_type_by_id(btf, size_type_id); if (btf_type_nosize_or_null(size_type)) return NULL; if (btf_type_has_size(size_type)) { size = size_type->size; } else if (btf_type_is_array(size_type)) { size = btf_resolved_type_size(btf, size_type_id); } else if (btf_type_is_ptr(size_type)) { size = sizeof(void *); } else { if (WARN_ON_ONCE(!btf_type_is_modifier(size_type) && !btf_type_is_var(size_type))) return NULL; size_type_id = btf_resolved_type_id(btf, size_type_id); size_type = btf_type_by_id(btf, size_type_id); if (btf_type_nosize_or_null(size_type)) return NULL; else if (btf_type_has_size(size_type)) size = size_type->size; else if (btf_type_is_array(size_type)) size = btf_resolved_type_size(btf, size_type_id); else if (btf_type_is_ptr(size_type)) size = sizeof(void *); else return NULL; } *type_id = size_type_id; if (ret_size) *ret_size = size; return size_type; } static int btf_df_check_member(struct btf_verifier_env *env, const struct btf_type *struct_type, const struct btf_member *member, const struct btf_type *member_type) { btf_verifier_log_basic(env, struct_type, "Unsupported check_member"); return -EINVAL; } static int btf_df_check_kflag_member(struct btf_verifier_env *env, const struct btf_type *struct_type, const struct btf_member *member, const struct btf_type *member_type) { btf_verifier_log_basic(env, struct_type, "Unsupported check_kflag_member"); return -EINVAL; } /* Used for ptr, array struct/union and float type members. * int, enum and modifier types have their specific callback functions. */ static int btf_generic_check_kflag_member(struct btf_verifier_env *env, const struct btf_type *struct_type, const struct btf_member *member, const struct btf_type *member_type) { if (BTF_MEMBER_BITFIELD_SIZE(member->offset)) { btf_verifier_log_member(env, struct_type, member, "Invalid member bitfield_size"); return -EINVAL; } /* bitfield size is 0, so member->offset represents bit offset only. * It is safe to call non kflag check_member variants. */ return btf_type_ops(member_type)->check_member(env, struct_type, member, member_type); } static int btf_df_resolve(struct btf_verifier_env *env, const struct resolve_vertex *v) { btf_verifier_log_basic(env, v->t, "Unsupported resolve"); return -EINVAL; } static void btf_df_show(const struct btf *btf, const struct btf_type *t, u32 type_id, void *data, u8 bits_offsets, struct btf_show *show) { btf_show(show, "<unsupported kind:%u>", BTF_INFO_KIND(t->info)); } static int btf_int_check_member(struct btf_verifier_env *env, const struct btf_type *struct_type, const struct btf_member *member, const struct btf_type *member_type) { u32 int_data = btf_type_int(member_type); u32 struct_bits_off = member->offset; u32 struct_size = struct_type->size; u32 nr_copy_bits; u32 bytes_offset; if (U32_MAX - struct_bits_off < BTF_INT_OFFSET(int_data)) { btf_verifier_log_member(env, struct_type, member, "bits_offset exceeds U32_MAX"); return -EINVAL; } struct_bits_off += BTF_INT_OFFSET(int_data); bytes_offset = BITS_ROUNDDOWN_BYTES(struct_bits_off); nr_copy_bits = BTF_INT_BITS(int_data) + BITS_PER_BYTE_MASKED(struct_bits_off); if (nr_copy_bits > BITS_PER_U128) { btf_verifier_log_member(env, struct_type, member, "nr_copy_bits exceeds 128"); return -EINVAL; } if (struct_size < bytes_offset || struct_size - bytes_offset < BITS_ROUNDUP_BYTES(nr_copy_bits)) { btf_verifier_log_member(env, struct_type, member, "Member exceeds struct_size"); return -EINVAL; } return 0; } static int btf_int_check_kflag_member(struct btf_verifier_env *env, const struct btf_type *struct_type, const struct btf_member *member, const struct btf_type *member_type) { u32 struct_bits_off, nr_bits, nr_int_data_bits, bytes_offset; u32 int_data = btf_type_int(member_type); u32 struct_size = struct_type->size; u32 nr_copy_bits; /* a regular int type is required for the kflag int member */ if (!btf_type_int_is_regular(member_type)) { btf_verifier_log_member(env, struct_type, member, "Invalid member base type"); return -EINVAL; } /* check sanity of bitfield size */ nr_bits = BTF_MEMBER_BITFIELD_SIZE(member->offset); struct_bits_off = BTF_MEMBER_BIT_OFFSET(member->offset); nr_int_data_bits = BTF_INT_BITS(int_data); if (!nr_bits) { /* Not a bitfield member, member offset must be at byte * boundary. */ if (BITS_PER_BYTE_MASKED(struct_bits_off)) { btf_verifier_log_member(env, struct_type, member, "Invalid member offset"); return -EINVAL; } nr_bits = nr_int_data_bits; } else if (nr_bits > nr_int_data_bits) { btf_verifier_log_member(env, struct_type, member, "Invalid member bitfield_size"); return -EINVAL; } bytes_offset = BITS_ROUNDDOWN_BYTES(struct_bits_off); nr_copy_bits = nr_bits + BITS_PER_BYTE_MASKED(struct_bits_off); if (nr_copy_bits > BITS_PER_U128) { btf_verifier_log_member(env, struct_type, member, "nr_copy_bits exceeds 128"); return -EINVAL; } if (struct_size < bytes_offset || struct_size - bytes_offset < BITS_ROUNDUP_BYTES(nr_copy_bits)) { btf_verifier_log_member(env, struct_type, member, "Member exceeds struct_size"); return -EINVAL; } return 0; } static s32 btf_int_check_meta(struct btf_verifier_env *env, const struct btf_type *t, u32 meta_left) { u32 int_data, nr_bits, meta_needed = sizeof(int_data); u16 encoding; if (meta_left < meta_needed) { btf_verifier_log_basic(env, t, "meta_left:%u meta_needed:%u", meta_left, meta_needed); return -EINVAL; } if (btf_type_vlen(t)) { btf_verifier_log_type(env, t, "vlen != 0"); return -EINVAL; } if (btf_type_kflag(t)) { btf_verifier_log_type(env, t, "Invalid btf_info kind_flag"); return -EINVAL; } int_data = btf_type_int(t); if (int_data & ~BTF_INT_MASK) { btf_verifier_log_basic(env, t, "Invalid int_data:%x", int_data); return -EINVAL; } nr_bits = BTF_INT_BITS(int_data) + BTF_INT_OFFSET(int_data); if (nr_bits > BITS_PER_U128) { btf_verifier_log_type(env, t, "nr_bits exceeds %zu", BITS_PER_U128); return -EINVAL; } if (BITS_ROUNDUP_BYTES(nr_bits) > t->size) { btf_verifier_log_type(env, t, "nr_bits exceeds type_size"); return -EINVAL; } /* * Only one of the encoding bits is allowed and it * should be sufficient for the pretty print purpose (i.e. decoding). * Multiple bits can be allowed later if it is found * to be insufficient. */ encoding = BTF_INT_ENCODING(int_data); if (encoding && encoding != BTF_INT_SIGNED && encoding != BTF_INT_CHAR && encoding != BTF_INT_BOOL) { btf_verifier_log_type(env, t, "Unsupported encoding"); return -ENOTSUPP; } btf_verifier_log_type(env, t, NULL); return meta_needed; } static void btf_int_log(struct btf_verifier_env *env, const struct btf_type *t) { int int_data = btf_type_int(t); btf_verifier_log(env, "size=%u bits_offset=%u nr_bits=%u encoding=%s", t->size, BTF_INT_OFFSET(int_data), BTF_INT_BITS(int_data), btf_int_encoding_str(BTF_INT_ENCODING(int_data))); } static void btf_int128_print(struct btf_show *show, void *data) { /* data points to a __int128 number. * Suppose * int128_num = *(__int128 *)data; * The below formulas shows what upper_num and lower_num represents: * upper_num = int128_num >> 64; * lower_num = int128_num & 0xffffffffFFFFFFFFULL; */ u64 upper_num, lower_num; #ifdef __BIG_ENDIAN_BITFIELD upper_num = *(u64 *)data; lower_num = *(u64 *)(data + 8); #else upper_num = *(u64 *)(data + 8); lower_num = *(u64 *)data; #endif if (upper_num == 0) btf_show_type_value(show, "0x%llx", lower_num); else btf_show_type_values(show, "0x%llx%016llx", upper_num, lower_num); } static void btf_int128_shift(u64 *print_num, u16 left_shift_bits, u16 right_shift_bits) { u64 upper_num, lower_num; #ifdef __BIG_ENDIAN_BITFIELD upper_num = print_num[0]; lower_num = print_num[1]; #else upper_num = print_num[1]; lower_num = print_num[0]; #endif /* shake out un-needed bits by shift/or operations */ if (left_shift_bits >= 64) { upper_num = lower_num << (left_shift_bits - 64); lower_num = 0; } else { upper_num = (upper_num << left_shift_bits) | (lower_num >> (64 - left_shift_bits)); lower_num = lower_num << left_shift_bits; } if (right_shift_bits >= 64) { lower_num = upper_num >> (right_shift_bits - 64); upper_num = 0; } else { lower_num = (lower_num >> right_shift_bits) | (upper_num << (64 - right_shift_bits)); upper_num = upper_num >> right_shift_bits; } #ifdef __BIG_ENDIAN_BITFIELD print_num[0] = upper_num; print_num[1] = lower_num; #else print_num[0] = lower_num; print_num[1] = upper_num; #endif } static void btf_bitfield_show(void *data, u8 bits_offset, u8 nr_bits, struct btf_show *show) { u16 left_shift_bits, right_shift_bits; u8 nr_copy_bytes; u8 nr_copy_bits; u64 print_num[2] = {}; nr_copy_bits = nr_bits + bits_offset; nr_copy_bytes = BITS_ROUNDUP_BYTES(nr_copy_bits); memcpy(print_num, data, nr_copy_bytes); #ifdef __BIG_ENDIAN_BITFIELD left_shift_bits = bits_offset; #else left_shift_bits = BITS_PER_U128 - nr_copy_bits; #endif right_shift_bits = BITS_PER_U128 - nr_bits; btf_int128_shift(print_num, left_shift_bits, right_shift_bits); btf_int128_print(show, print_num); } static void btf_int_bits_show(const struct btf *btf, const struct btf_type *t, void *data, u8 bits_offset, struct btf_show *show) { u32 int_data = btf_type_int(t); u8 nr_bits = BTF_INT_BITS(int_data); u8 total_bits_offset; /* * bits_offset is at most 7. * BTF_INT_OFFSET() cannot exceed 128 bits. */ total_bits_offset = bits_offset + BTF_INT_OFFSET(int_data); data += BITS_ROUNDDOWN_BYTES(total_bits_offset); bits_offset = BITS_PER_BYTE_MASKED(total_bits_offset); btf_bitfield_show(data, bits_offset, nr_bits, show); } static void btf_int_show(const struct btf *btf, const struct btf_type *t, u32 type_id, void *data, u8 bits_offset, struct btf_show *show) { u32 int_data = btf_type_int(t); u8 encoding = BTF_INT_ENCODING(int_data); bool sign = encoding & BTF_INT_SIGNED; u8 nr_bits = BTF_INT_BITS(int_data); void *safe_data; safe_data = btf_show_start_type(show, t, type_id, data); if (!safe_data) return; if (bits_offset || BTF_INT_OFFSET(int_data) || BITS_PER_BYTE_MASKED(nr_bits)) { btf_int_bits_show(btf, t, safe_data, bits_offset, show); goto out; } switch (nr_bits) { case 128: btf_int128_print(show, safe_data); break; case 64: if (sign) btf_show_type_value(show, "%lld", *(s64 *)safe_data); else btf_show_type_value(show, "%llu", *(u64 *)safe_data); break; case 32: if (sign) btf_show_type_value(show, "%d", *(s32 *)safe_data); else btf_show_type_value(show, "%u", *(u32 *)safe_data); break; case 16: if (sign) btf_show_type_value(show, "%d", *(s16 *)safe_data); else btf_show_type_value(show, "%u", *(u16 *)safe_data); break; case 8: if (show->state.array_encoding == BTF_INT_CHAR) { /* check for null terminator */ if (show->state.array_terminated) break; if (*(char *)data == '\0') { show->state.array_terminated = 1; break; } if (isprint(*(char *)data)) { btf_show_type_value(show, "'%c'", *(char *)safe_data); break; } } if (sign) btf_show_type_value(show, "%d", *(s8 *)safe_data); else btf_show_type_value(show, "%u", *(u8 *)safe_data); break; default: btf_int_bits_show(btf, t, safe_data, bits_offset, show); break; } out: btf_show_end_type(show); } static const struct btf_kind_operations int_ops = { .check_meta = btf_int_check_meta, .resolve = btf_df_resolve, .check_member = btf_int_check_member, .check_kflag_member = btf_int_check_kflag_member, .log_details = btf_int_log, .show = btf_int_show, }; static int btf_modifier_check_member(struct btf_verifier_env *env, const struct btf_type *struct_type, const struct btf_member *member, const struct btf_type *member_type) { const struct btf_type *resolved_type; u32 resolved_type_id = member->type; struct btf_member resolved_member; struct btf *btf = env->btf; resolved_type = btf_type_id_size(btf, &resolved_type_id, NULL); if (!resolved_type) { btf_verifier_log_member(env, struct_type, member, "Invalid member"); return -EINVAL; } resolved_member = *member; resolved_member.type = resolved_type_id; return btf_type_ops(resolved_type)->check_member(env, struct_type, &resolved_member, resolved_type); } static int btf_modifier_check_kflag_member(struct btf_verifier_env *env, const struct btf_type *struct_type, const struct btf_member *member, const struct btf_type *member_type) { const struct btf_type *resolved_type; u32 resolved_type_id = member->type; struct btf_member resolved_member; struct btf *btf = env->btf; resolved_type = btf_type_id_size(btf, &resolved_type_id, NULL); if (!resolved_type) { btf_verifier_log_member(env, struct_type, member, "Invalid member"); return -EINVAL; } resolved_member = *member; resolved_member.type = resolved_type_id; return btf_type_ops(resolved_type)->check_kflag_member(env, struct_type, &resolved_member, resolved_type); } static int btf_ptr_check_member(struct btf_verifier_env *env, const struct btf_type *struct_type, const struct btf_member *member, const struct btf_type *member_type) { u32 struct_size, struct_bits_off, bytes_offset; struct_size = struct_type->size; struct_bits_off = member->offset; bytes_offset = BITS_ROUNDDOWN_BYTES(struct_bits_off); if (BITS_PER_BYTE_MASKED(struct_bits_off)) { btf_verifier_log_member(env, struct_type, member, "Member is not byte aligned"); return -EINVAL; } if (struct_size - bytes_offset < sizeof(void *)) { btf_verifier_log_member(env, struct_type, member, "Member exceeds struct_size"); return -EINVAL; } return 0; } static int btf_ref_type_check_meta(struct btf_verifier_env *env, const struct btf_type *t, u32 meta_left) { const char *value; if (btf_type_vlen(t)) { btf_verifier_log_type(env, t, "vlen != 0"); return -EINVAL; } if (btf_type_kflag(t) && !btf_type_is_type_tag(t)) { btf_verifier_log_type(env, t, "Invalid btf_info kind_flag"); return -EINVAL; } if (!BTF_TYPE_ID_VALID(t->type)) { btf_verifier_log_type(env, t, "Invalid type_id"); return -EINVAL; } /* typedef/type_tag type must have a valid name, and other ref types, * volatile, const, restrict, should have a null name. */ if (BTF_INFO_KIND(t->info) == BTF_KIND_TYPEDEF) { if (!t->name_off || !btf_name_valid_identifier(env->btf, t->name_off)) { btf_verifier_log_type(env, t, "Invalid name"); return -EINVAL; } } else if (BTF_INFO_KIND(t->info) == BTF_KIND_TYPE_TAG) { value = btf_name_by_offset(env->btf, t->name_off); if (!value || !value[0]) { btf_verifier_log_type(env, t, "Invalid name"); return -EINVAL; } } else { if (t->name_off) { btf_verifier_log_type(env, t, "Invalid name"); return -EINVAL; } } btf_verifier_log_type(env, t, NULL); return 0; } static int btf_modifier_resolve(struct btf_verifier_env *env, const struct resolve_vertex *v) { const struct btf_type *t = v->t; const struct btf_type *next_type; u32 next_type_id = t->type; struct btf *btf = env->btf; next_type = btf_type_by_id(btf, next_type_id); if (!next_type || btf_type_is_resolve_source_only(next_type)) { btf_verifier_log_type(env, v->t, "Invalid type_id"); return -EINVAL; } if (!env_type_is_resolve_sink(env, next_type) && !env_type_is_resolved(env, next_type_id)) return env_stack_push(env, next_type, next_type_id); /* Figure out the resolved next_type_id with size. * They will be stored in the current modifier's * resolved_ids and resolved_sizes such that it can * save us a few type-following when we use it later (e.g. in * pretty print). */ if (!btf_type_id_size(btf, &next_type_id, NULL)) { if (env_type_is_resolved(env, next_type_id)) next_type = btf_type_id_resolve(btf, &next_type_id); /* "typedef void new_void", "const void"...etc */ if (!btf_type_is_void(next_type) && !btf_type_is_fwd(next_type) && !btf_type_is_func_proto(next_type)) { btf_verifier_log_type(env, v->t, "Invalid type_id"); return -EINVAL; } } env_stack_pop_resolved(env, next_type_id, 0); return 0; } static int btf_var_resolve(struct btf_verifier_env *env, const struct resolve_vertex *v) { const struct btf_type *next_type; const struct btf_type *t = v->t; u32 next_type_id = t->type; struct btf *btf = env->btf; next_type = btf_type_by_id(btf, next_type_id); if (!next_type || btf_type_is_resolve_source_only(next_type)) { btf_verifier_log_type(env, v->t, "Invalid type_id"); return -EINVAL; } if (!env_type_is_resolve_sink(env, next_type) && !env_type_is_resolved(env, next_type_id)) return env_stack_push(env, next_type, next_type_id); if (btf_type_is_modifier(next_type)) { const struct btf_type *resolved_type; u32 resolved_type_id; resolved_type_id = next_type_id; resolved_type = btf_type_id_resolve(btf, &resolved_type_id); if (btf_type_is_ptr(resolved_type) && !env_type_is_resolve_sink(env, resolved_type) && !env_type_is_resolved(env, resolved_type_id)) return env_stack_push(env, resolved_type, resolved_type_id); } /* We must resolve to something concrete at this point, no * forward types or similar that would resolve to size of * zero is allowed. */ if (!btf_type_id_size(btf, &next_type_id, NULL)) { btf_verifier_log_type(env, v->t, "Invalid type_id"); return -EINVAL; } env_stack_pop_resolved(env, next_type_id, 0); return 0; } static int btf_ptr_resolve(struct btf_verifier_env *env, const struct resolve_vertex *v) { const struct btf_type *next_type; const struct btf_type *t = v->t; u32 next_type_id = t->type; struct btf *btf = env->btf; next_type = btf_type_by_id(btf, next_type_id); if (!next_type || btf_type_is_resolve_source_only(next_type)) { btf_verifier_log_type(env, v->t, "Invalid type_id"); return -EINVAL; } if (!env_type_is_resolve_sink(env, next_type) && !env_type_is_resolved(env, next_type_id)) return env_stack_push(env, next_type, next_type_id); /* If the modifier was RESOLVED during RESOLVE_STRUCT_OR_ARRAY, * the modifier may have stopped resolving when it was resolved * to a ptr (last-resolved-ptr). * * We now need to continue from the last-resolved-ptr to * ensure the last-resolved-ptr will not referring back to * the current ptr (t). */ if (btf_type_is_modifier(next_type)) { const struct btf_type *resolved_type; u32 resolved_type_id; resolved_type_id = next_type_id; resolved_type = btf_type_id_resolve(btf, &resolved_type_id); if (btf_type_is_ptr(resolved_type) && !env_type_is_resolve_sink(env, resolved_type) && !env_type_is_resolved(env, resolved_type_id)) return env_stack_push(env, resolved_type, resolved_type_id); } if (!btf_type_id_size(btf, &next_type_id, NULL)) { if (env_type_is_resolved(env, next_type_id)) next_type = btf_type_id_resolve(btf, &next_type_id); if (!btf_type_is_void(next_type) && !btf_type_is_fwd(next_type) && !btf_type_is_func_proto(next_type)) { btf_verifier_log_type(env, v->t, "Invalid type_id"); return -EINVAL; } } env_stack_pop_resolved(env, next_type_id, 0); return 0; } static void btf_modifier_show(const struct btf *btf, const struct btf_type *t, u32 type_id, void *data, u8 bits_offset, struct btf_show *show) { if (btf->resolved_ids) t = btf_type_id_resolve(btf, &type_id); else t = btf_type_skip_modifiers(btf, type_id, NULL); btf_type_ops(t)->show(btf, t, type_id, data, bits_offset, show); } static void btf_var_show(const struct btf *btf, const struct btf_type *t, u32 type_id, void *data, u8 bits_offset, struct btf_show *show) { t = btf_type_id_resolve(btf, &type_id); btf_type_ops(t)->show(btf, t, type_id, data, bits_offset, show); } static void btf_ptr_show(const struct btf *btf, const struct btf_type *t, u32 type_id, void *data, u8 bits_offset, struct btf_show *show) { void *safe_data; safe_data = btf_show_start_type(show, t, type_id, data); if (!safe_data) return; /* It is a hashed value unless BTF_SHOW_PTR_RAW is specified */ if (show->flags & BTF_SHOW_PTR_RAW) btf_show_type_value(show, "0x%px", *(void **)safe_data); else btf_show_type_value(show, "0x%p", *(void **)safe_data); btf_show_end_type(show); } static void btf_ref_type_log(struct btf_verifier_env *env, const struct btf_type *t) { btf_verifier_log(env, "type_id=%u", t->type); } static const struct btf_kind_operations modifier_ops = { .check_meta = btf_ref_type_check_meta, .resolve = btf_modifier_resolve, .check_member = btf_modifier_check_member, .check_kflag_member = btf_modifier_check_kflag_member, .log_details = btf_ref_type_log, .show = btf_modifier_show, }; static const struct btf_kind_operations ptr_ops = { .check_meta = btf_ref_type_check_meta, .resolve = btf_ptr_resolve, .check_member = btf_ptr_check_member, .check_kflag_member = btf_generic_check_kflag_member, .log_details = btf_ref_type_log, .show = btf_ptr_show, }; static s32 btf_fwd_check_meta(struct btf_verifier_env *env, const struct btf_type *t, u32 meta_left) { if (btf_type_vlen(t)) { btf_verifier_log_type(env, t, "vlen != 0"); return -EINVAL; } if (t->type) { btf_verifier_log_type(env, t, "type != 0"); return -EINVAL; } /* fwd type must have a valid name */ if (!t->name_off || !btf_name_valid_identifier(env->btf, t->name_off)) { btf_verifier_log_type(env, t, "Invalid name"); return -EINVAL; } btf_verifier_log_type(env, t, NULL); return 0; } static void btf_fwd_type_log(struct btf_verifier_env *env, const struct btf_type *t) { btf_verifier_log(env, "%s", btf_type_kflag(t) ? "union" : "struct"); } static const struct btf_kind_operations fwd_ops = { .check_meta = btf_fwd_check_meta, .resolve = btf_df_resolve, .check_member = btf_df_check_member, .check_kflag_member = btf_df_check_kflag_member, .log_details = btf_fwd_type_log, .show = btf_df_show, }; static int btf_array_check_member(struct btf_verifier_env *env, const struct btf_type *struct_type, const struct btf_member *member, const struct btf_type *member_type) { u32 struct_bits_off = member->offset; u32 struct_size, bytes_offset; u32 array_type_id, array_size; struct btf *btf = env->btf; if (BITS_PER_BYTE_MASKED(struct_bits_off)) { btf_verifier_log_member(env, struct_type, member, "Member is not byte aligned"); return -EINVAL; } array_type_id = member->type; btf_type_id_size(btf, &array_type_id, &array_size); struct_size = struct_type->size; bytes_offset = BITS_ROUNDDOWN_BYTES(struct_bits_off); if (struct_size - bytes_offset < array_size) { btf_verifier_log_member(env, struct_type, member, "Member exceeds struct_size"); return -EINVAL; } return 0; } static s32 btf_array_check_meta(struct btf_verifier_env *env, const struct btf_type *t, u32 meta_left) { const struct btf_array *array = btf_type_array(t); u32 meta_needed = sizeof(*array); if (meta_left < meta_needed) { btf_verifier_log_basic(env, t, "meta_left:%u meta_needed:%u", meta_left, meta_needed); return -EINVAL; } /* array type should not have a name */ if (t->name_off) { btf_verifier_log_type(env, t, "Invalid name"); return -EINVAL; } if (btf_type_vlen(t)) { btf_verifier_log_type(env, t, "vlen != 0"); return -EINVAL; } if (btf_type_kflag(t)) { btf_verifier_log_type(env, t, "Invalid btf_info kind_flag"); return -EINVAL; } if (t->size) { btf_verifier_log_type(env, t, "size != 0"); return -EINVAL; } /* Array elem type and index type cannot be in type void, * so !array->type and !array->index_type are not allowed. */ if (!array->type || !BTF_TYPE_ID_VALID(array->type)) { btf_verifier_log_type(env, t, "Invalid elem"); return -EINVAL; } if (!array->index_type || !BTF_TYPE_ID_VALID(array->index_type)) { btf_verifier_log_type(env, t, "Invalid index"); return -EINVAL; } btf_verifier_log_type(env, t, NULL); return meta_needed; } static int btf_array_resolve(struct btf_verifier_env *env, const struct resolve_vertex *v) { const struct btf_array *array = btf_type_array(v->t); const struct btf_type *elem_type, *index_type; u32 elem_type_id, index_type_id; struct btf *btf = env->btf; u32 elem_size; /* Check array->index_type */ index_type_id = array->index_type; index_type = btf_type_by_id(btf, index_type_id); if (btf_type_nosize_or_null(index_type) || btf_type_is_resolve_source_only(index_type)) { btf_verifier_log_type(env, v->t, "Invalid index"); return -EINVAL; } if (!env_type_is_resolve_sink(env, index_type) && !env_type_is_resolved(env, index_type_id)) return env_stack_push(env, index_type, index_type_id); index_type = btf_type_id_size(btf, &index_type_id, NULL); if (!index_type || !btf_type_is_int(index_type) || !btf_type_int_is_regular(index_type)) { btf_verifier_log_type(env, v->t, "Invalid index"); return -EINVAL; } /* Check array->type */ elem_type_id = array->type; elem_type = btf_type_by_id(btf, elem_type_id); if (btf_type_nosize_or_null(elem_type) || btf_type_is_resolve_source_only(elem_type)) { btf_verifier_log_type(env, v->t, "Invalid elem"); return -EINVAL; } if (!env_type_is_resolve_sink(env, elem_type) && !env_type_is_resolved(env, elem_type_id)) return env_stack_push(env, elem_type, elem_type_id); elem_type = btf_type_id_size(btf, &elem_type_id, &elem_size); if (!elem_type) { btf_verifier_log_type(env, v->t, "Invalid elem"); return -EINVAL; } if (btf_type_is_int(elem_type) && !btf_type_int_is_regular(elem_type)) { btf_verifier_log_type(env, v->t, "Invalid array of int"); return -EINVAL; } if (array->nelems && elem_size > U32_MAX / array->nelems) { btf_verifier_log_type(env, v->t, "Array size overflows U32_MAX"); return -EINVAL; } env_stack_pop_resolved(env, elem_type_id, elem_size * array->nelems); return 0; } static void btf_array_log(struct btf_verifier_env *env, const struct btf_type *t) { const struct btf_array *array = btf_type_array(t); btf_verifier_log(env, "type_id=%u index_type_id=%u nr_elems=%u", array->type, array->index_type, array->nelems); } static void __btf_array_show(const struct btf *btf, const struct btf_type *t, u32 type_id, void *data, u8 bits_offset, struct btf_show *show) { const struct btf_array *array = btf_type_array(t); const struct btf_kind_operations *elem_ops; const struct btf_type *elem_type; u32 i, elem_size = 0, elem_type_id; u16 encoding = 0; elem_type_id = array->type; elem_type = btf_type_skip_modifiers(btf, elem_type_id, NULL); if (elem_type && btf_type_has_size(elem_type)) elem_size = elem_type->size; if (elem_type && btf_type_is_int(elem_type)) { u32 int_type = btf_type_int(elem_type); encoding = BTF_INT_ENCODING(int_type); /* * BTF_INT_CHAR encoding never seems to be set for * char arrays, so if size is 1 and element is * printable as a char, we'll do that. */ if (elem_size == 1) encoding = BTF_INT_CHAR; } if (!btf_show_start_array_type(show, t, type_id, encoding, data)) return; if (!elem_type) goto out; elem_ops = btf_type_ops(elem_type); for (i = 0; i < array->nelems; i++) { btf_show_start_array_member(show); elem_ops->show(btf, elem_type, elem_type_id, data, bits_offset, show); data += elem_size; btf_show_end_array_member(show); if (show->state.array_terminated) break; } out: btf_show_end_array_type(show); } static void btf_array_show(const struct btf *btf, const struct btf_type *t, u32 type_id, void *data, u8 bits_offset, struct btf_show *show) { const struct btf_member *m = show->state.member; /* * First check if any members would be shown (are non-zero). * See comments above "struct btf_show" definition for more * details on how this works at a high-level. */ if (show->state.depth > 0 && !(show->flags & BTF_SHOW_ZERO)) { if (!show->state.depth_check) { show->state.depth_check = show->state.depth + 1; show->state.depth_to_show = 0; } __btf_array_show(btf, t, type_id, data, bits_offset, show); show->state.member = m; if (show->state.depth_check != show->state.depth + 1) return; show->state.depth_check = 0; if (show->state.depth_to_show <= show->state.depth) return; /* * Reaching here indicates we have recursed and found * non-zero array member(s). */ } __btf_array_show(btf, t, type_id, data, bits_offset, show); } static const struct btf_kind_operations array_ops = { .check_meta = btf_array_check_meta, .resolve = btf_array_resolve, .check_member = btf_array_check_member, .check_kflag_member = btf_generic_check_kflag_member, .log_details = btf_array_log, .show = btf_array_show, }; static int btf_struct_check_member(struct btf_verifier_env *env, const struct btf_type *struct_type, const struct btf_member *member, const struct btf_type *member_type) { u32 struct_bits_off = member->offset; u32 struct_size, bytes_offset; if (BITS_PER_BYTE_MASKED(struct_bits_off)) { btf_verifier_log_member(env, struct_type, member, "Member is not byte aligned"); return -EINVAL; } struct_size = struct_type->size; bytes_offset = BITS_ROUNDDOWN_BYTES(struct_bits_off); if (struct_size - bytes_offset < member_type->size) { btf_verifier_log_member(env, struct_type, member, "Member exceeds struct_size"); return -EINVAL; } return 0; } static s32 btf_struct_check_meta(struct btf_verifier_env *env, const struct btf_type *t, u32 meta_left) { bool is_union = BTF_INFO_KIND(t->info) == BTF_KIND_UNION; const struct btf_member *member; u32 meta_needed, last_offset; struct btf *btf = env->btf; u32 struct_size = t->size; u32 offset; u16 i; meta_needed = btf_type_vlen(t) * sizeof(*member); if (meta_left < meta_needed) { btf_verifier_log_basic(env, t, "meta_left:%u meta_needed:%u", meta_left, meta_needed); return -EINVAL; } /* struct type either no name or a valid one */ if (t->name_off && !btf_name_valid_identifier(env->btf, t->name_off)) { btf_verifier_log_type(env, t, "Invalid name"); return -EINVAL; } btf_verifier_log_type(env, t, NULL); last_offset = 0; for_each_member(i, t, member) { if (!btf_name_offset_valid(btf, member->name_off)) { btf_verifier_log_member(env, t, member, "Invalid member name_offset:%u", member->name_off); return -EINVAL; } /* struct member either no name or a valid one */ if (member->name_off && !btf_name_valid_identifier(btf, member->name_off)) { btf_verifier_log_member(env, t, member, "Invalid name"); return -EINVAL; } /* A member cannot be in type void */ if (!member->type || !BTF_TYPE_ID_VALID(member->type)) { btf_verifier_log_member(env, t, member, "Invalid type_id"); return -EINVAL; } offset = __btf_member_bit_offset(t, member); if (is_union && offset) { btf_verifier_log_member(env, t, member, "Invalid member bits_offset"); return -EINVAL; } /* * ">" instead of ">=" because the last member could be * "char a[0];" */ if (last_offset > offset) { btf_verifier_log_member(env, t, member, "Invalid member bits_offset"); return -EINVAL; } if (BITS_ROUNDUP_BYTES(offset) > struct_size) { btf_verifier_log_member(env, t, member, "Member bits_offset exceeds its struct size"); return -EINVAL; } btf_verifier_log_member(env, t, member, NULL); last_offset = offset; } return meta_needed; } static int btf_struct_resolve(struct btf_verifier_env *env, const struct resolve_vertex *v) { const struct btf_member *member; int err; u16 i; /* Before continue resolving the next_member, * ensure the last member is indeed resolved to a * type with size info. */ if (v->next_member) { const struct btf_type *last_member_type; const struct btf_member *last_member; u32 last_member_type_id; last_member = btf_type_member(v->t) + v->next_member - 1; last_member_type_id = last_member->type; if (WARN_ON_ONCE(!env_type_is_resolved(env, last_member_type_id))) return -EINVAL; last_member_type = btf_type_by_id(env->btf, last_member_type_id); if (btf_type_kflag(v->t)) err = btf_type_ops(last_member_type)->check_kflag_member(env, v->t, last_member, last_member_type); else err = btf_type_ops(last_member_type)->check_member(env, v->t, last_member, last_member_type); if (err) return err; } for_each_member_from(i, v->next_member, v->t, member) { u32 member_type_id = member->type; const struct btf_type *member_type = btf_type_by_id(env->btf, member_type_id); if (btf_type_nosize_or_null(member_type) || btf_type_is_resolve_source_only(member_type)) { btf_verifier_log_member(env, v->t, member, "Invalid member"); return -EINVAL; } if (!env_type_is_resolve_sink(env, member_type) && !env_type_is_resolved(env, member_type_id)) { env_stack_set_next_member(env, i + 1); return env_stack_push(env, member_type, member_type_id); } if (btf_type_kflag(v->t)) err = btf_type_ops(member_type)->check_kflag_member(env, v->t, member, member_type); else err = btf_type_ops(member_type)->check_member(env, v->t, member, member_type); if (err) return err; } env_stack_pop_resolved(env, 0, 0); return 0; } static void btf_struct_log(struct btf_verifier_env *env, const struct btf_type *t) { btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t)); } enum { BTF_FIELD_IGNORE = 0, BTF_FIELD_FOUND = 1, }; struct btf_field_info { enum btf_field_type type; u32 off; union { struct { u32 type_id; } kptr; struct { const char *node_name; u32 value_btf_id; } graph_root; }; }; static int btf_find_struct(const struct btf *btf, const struct btf_type *t, u32 off, int sz, enum btf_field_type field_type, struct btf_field_info *info) { if (!__btf_type_is_struct(t)) return BTF_FIELD_IGNORE; if (t->size != sz) return BTF_FIELD_IGNORE; info->type = field_type; info->off = off; return BTF_FIELD_FOUND; } static int btf_find_kptr(const struct btf *btf, const struct btf_type *t, u32 off, int sz, struct btf_field_info *info, u32 field_mask) { enum btf_field_type type; const char *tag_value; bool is_type_tag; u32 res_id; /* Permit modifiers on the pointer itself */ if (btf_type_is_volatile(t)) t = btf_type_by_id(btf, t->type); /* For PTR, sz is always == 8 */ if (!btf_type_is_ptr(t)) return BTF_FIELD_IGNORE; t = btf_type_by_id(btf, t->type); is_type_tag = btf_type_is_type_tag(t) && !btf_type_kflag(t); if (!is_type_tag) return BTF_FIELD_IGNORE; /* Reject extra tags */ if (btf_type_is_type_tag(btf_type_by_id(btf, t->type))) return -EINVAL; tag_value = __btf_name_by_offset(btf, t->name_off); if (!strcmp("kptr_untrusted", tag_value)) type = BPF_KPTR_UNREF; else if (!strcmp("kptr", tag_value)) type = BPF_KPTR_REF; else if (!strcmp("percpu_kptr", tag_value)) type = BPF_KPTR_PERCPU; else if (!strcmp("uptr", tag_value)) type = BPF_UPTR; else return -EINVAL; if (!(type & field_mask)) return BTF_FIELD_IGNORE; /* Get the base type */ t = btf_type_skip_modifiers(btf, t->type, &res_id); /* Only pointer to struct is allowed */ if (!__btf_type_is_struct(t)) return -EINVAL; info->type = type; info->off = off; info->kptr.type_id = res_id; return BTF_FIELD_FOUND; } int btf_find_next_decl_tag(const struct btf *btf, const struct btf_type *pt, int comp_idx, const char *tag_key, int last_id) { int len = strlen(tag_key); int i, n; for (i = last_id + 1, n = btf_nr_types(btf); i < n; i++) { const struct btf_type *t = btf_type_by_id(btf, i); if (!btf_type_is_decl_tag(t)) continue; if (pt != btf_type_by_id(btf, t->type)) continue; if (btf_type_decl_tag(t)->component_idx != comp_idx) continue; if (strncmp(__btf_name_by_offset(btf, t->name_off), tag_key, len)) continue; return i; } return -ENOENT; } const char *btf_find_decl_tag_value(const struct btf *btf, const struct btf_type *pt, int comp_idx, const char *tag_key) { const char *value = NULL; const struct btf_type *t; int len, id; id = btf_find_next_decl_tag(btf, pt, comp_idx, tag_key, 0); if (id < 0) return ERR_PTR(id); t = btf_type_by_id(btf, id); len = strlen(tag_key); value = __btf_name_by_offset(btf, t->name_off) + len; /* Prevent duplicate entries for same type */ id = btf_find_next_decl_tag(btf, pt, comp_idx, tag_key, id); if (id >= 0) return ERR_PTR(-EEXIST); return value; } static int btf_find_graph_root(const struct btf *btf, const struct btf_type *pt, const struct btf_type *t, int comp_idx, u32 off, int sz, struct btf_field_info *info, enum btf_field_type head_type) { const char *node_field_name; const char *value_type; s32 id; if (!__btf_type_is_struct(t)) return BTF_FIELD_IGNORE; if (t->size != sz) return BTF_FIELD_IGNORE; value_type = btf_find_decl_tag_value(btf, pt, comp_idx, "contains:"); if (IS_ERR(value_type)) return -EINVAL; node_field_name = strstr(value_type, ":"); if (!node_field_name) return -EINVAL; value_type = kstrndup(value_type, node_field_name - value_type, GFP_KERNEL_ACCOUNT | __GFP_NOWARN); if (!value_type) return -ENOMEM; id = btf_find_by_name_kind(btf, value_type, BTF_KIND_STRUCT); kfree(value_type); if (id < 0) return id; node_field_name++; if (str_is_empty(node_field_name)) return -EINVAL; info->type = head_type; info->off = off; info->graph_root.value_btf_id = id; info->graph_root.node_name = node_field_name; return BTF_FIELD_FOUND; } static int btf_get_field_type(const struct btf *btf, const struct btf_type *var_type, u32 field_mask, u32 *seen_mask, int *align, int *sz) { const struct { enum btf_field_type type; const char *const name; const bool is_unique; } field_types[] = { { BPF_SPIN_LOCK, "bpf_spin_lock", true }, { BPF_RES_SPIN_LOCK, "bpf_res_spin_lock", true }, { BPF_TIMER, "bpf_timer", true }, { BPF_WORKQUEUE, "bpf_wq", true }, { BPF_TASK_WORK, "bpf_task_work", true }, { BPF_LIST_HEAD, "bpf_list_head", false }, { BPF_LIST_NODE, "bpf_list_node", false }, { BPF_RB_ROOT, "bpf_rb_root", false }, { BPF_RB_NODE, "bpf_rb_node", false }, { BPF_REFCOUNT, "bpf_refcount", false }, }; int type = 0, i; const char *name = __btf_name_by_offset(btf, var_type->name_off); const char *field_type_name; enum btf_field_type field_type; bool is_unique; for (i = 0; i < ARRAY_SIZE(field_types); ++i) { field_type = field_types[i].type; field_type_name = field_types[i].name; is_unique = field_types[i].is_unique; if (!(field_mask & field_type) || strcmp(name, field_type_name)) continue; if (is_unique) { if (*seen_mask & field_type) return -E2BIG; *seen_mask |= field_type; } type = field_type; goto end; } /* Only return BPF_KPTR when all other types with matchable names fail */ if (field_mask & (BPF_KPTR | BPF_UPTR) && !__btf_type_is_struct(var_type)) { type = BPF_KPTR_REF; goto end; } return 0; end: *sz = btf_field_type_size(type); *align = btf_field_type_align(type); return type; } /* Repeat a number of fields for a specified number of times. * * Copy the fields starting from the first field and repeat them for * repeat_cnt times. The fields are repeated by adding the offset of each * field with * (i + 1) * elem_size * where i is the repeat index and elem_size is the size of an element. */ static int btf_repeat_fields(struct btf_field_info *info, int info_cnt, u32 field_cnt, u32 repeat_cnt, u32 elem_size) { u32 i, j; u32 cur; /* Ensure not repeating fields that should not be repeated. */ for (i = 0; i < field_cnt; i++) { switch (info[i].type) { case BPF_KPTR_UNREF: case BPF_KPTR_REF: case BPF_KPTR_PERCPU: case BPF_UPTR: case BPF_LIST_HEAD: case BPF_RB_ROOT: break; default: return -EINVAL; } } /* The type of struct size or variable size is u32, * so the multiplication will not overflow. */ if (field_cnt * (repeat_cnt + 1) > info_cnt) return -E2BIG; cur = field_cnt; for (i = 0; i < repeat_cnt; i++) { memcpy(&info[cur], &info[0], field_cnt * sizeof(info[0])); for (j = 0; j < field_cnt; j++) info[cur++].off += (i + 1) * elem_size; } return 0; } static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t, u32 field_mask, struct btf_field_info *info, int info_cnt, u32 level); /* Find special fields in the struct type of a field. * * This function is used to find fields of special types that is not a * global variable or a direct field of a struct type. It also handles the * repetition if it is the element type of an array. */ static int btf_find_nested_struct(const struct btf *btf, const struct btf_type *t, u32 off, u32 nelems, u32 field_mask, struct btf_field_info *info, int info_cnt, u32 level) { int ret, err, i; level++; if (level >= MAX_RESOLVE_DEPTH) return -E2BIG; ret = btf_find_struct_field(btf, t, field_mask, info, info_cnt, level); if (ret <= 0) return ret; /* Shift the offsets of the nested struct fields to the offsets * related to the container. */ for (i = 0; i < ret; i++) info[i].off += off; if (nelems > 1) { err = btf_repeat_fields(info, info_cnt, ret, nelems - 1, t->size); if (err == 0) ret *= nelems; else ret = err; } return ret; } static int btf_find_field_one(const struct btf *btf, const struct btf_type *var, const struct btf_type *var_type, int var_idx, u32 off, u32 expected_size, u32 field_mask, u32 *seen_mask, struct btf_field_info *info, int info_cnt, u32 level) { int ret, align, sz, field_type; struct btf_field_info tmp; const struct btf_array *array; u32 i, nelems = 1; /* Walk into array types to find the element type and the number of * elements in the (flattened) array. */ for (i = 0; i < MAX_RESOLVE_DEPTH && btf_type_is_array(var_type); i++) { array = btf_array(var_type); nelems *= array->nelems; var_type = btf_type_by_id(btf, array->type); } if (i == MAX_RESOLVE_DEPTH) return -E2BIG; if (nelems == 0) return 0; field_type = btf_get_field_type(btf, var_type, field_mask, seen_mask, &align, &sz); /* Look into variables of struct types */ if (!field_type && __btf_type_is_struct(var_type)) { sz = var_type->size; if (expected_size && expected_size != sz * nelems) return 0; ret = btf_find_nested_struct(btf, var_type, off, nelems, field_mask, &info[0], info_cnt, level); return ret; } if (field_type == 0) return 0; if (field_type < 0) return field_type; if (expected_size && expected_size != sz * nelems) return 0; if (off % align) return 0; switch (field_type) { case BPF_SPIN_LOCK: case BPF_RES_SPIN_LOCK: case BPF_TIMER: case BPF_WORKQUEUE: case BPF_LIST_NODE: case BPF_RB_NODE: case BPF_REFCOUNT: case BPF_TASK_WORK: ret = btf_find_struct(btf, var_type, off, sz, field_type, info_cnt ? &info[0] : &tmp); if (ret < 0) return ret; break; case BPF_KPTR_UNREF: case BPF_KPTR_REF: case BPF_KPTR_PERCPU: case BPF_UPTR: ret = btf_find_kptr(btf, var_type, off, sz, info_cnt ? &info[0] : &tmp, field_mask); if (ret < 0) return ret; break; case BPF_LIST_HEAD: case BPF_RB_ROOT: ret = btf_find_graph_root(btf, var, var_type, var_idx, off, sz, info_cnt ? &info[0] : &tmp, field_type); if (ret < 0) return ret; break; default: return -EFAULT; } if (ret == BTF_FIELD_IGNORE) return 0; if (!info_cnt) return -E2BIG; if (nelems > 1) { ret = btf_repeat_fields(info, info_cnt, 1, nelems - 1, sz); if (ret < 0) return ret; } return nelems; } static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t, u32 field_mask, struct btf_field_info *info, int info_cnt, u32 level) { int ret, idx = 0; const struct btf_member *member; u32 i, off, seen_mask = 0; for_each_member(i, t, member) { const struct btf_type *member_type = btf_type_by_id(btf, member->type); off = __btf_member_bit_offset(t, member); if (off % 8) /* valid C code cannot generate such BTF */ return -EINVAL; off /= 8; ret = btf_find_field_one(btf, t, member_type, i, off, 0, field_mask, &seen_mask, &info[idx], info_cnt - idx, level); if (ret < 0) return ret; idx += ret; } return idx; } static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t, u32 field_mask, struct btf_field_info *info, int info_cnt, u32 level) { int ret, idx = 0; const struct btf_var_secinfo *vsi; u32 i, off, seen_mask = 0; for_each_vsi(i, t, vsi) { const struct btf_type *var = btf_type_by_id(btf, vsi->type); const struct btf_type *var_type = btf_type_by_id(btf, var->type); off = vsi->offset; ret = btf_find_field_one(btf, var, var_type, -1, off, vsi->size, field_mask, &seen_mask, &info[idx], info_cnt - idx, level); if (ret < 0) return ret; idx += ret; } return idx; } static int btf_find_field(const struct btf *btf, const struct btf_type *t, u32 field_mask, struct btf_field_info *info, int info_cnt) { if (__btf_type_is_struct(t)) return btf_find_struct_field(btf, t, field_mask, info, info_cnt, 0); else if (btf_type_is_datasec(t)) return btf_find_datasec_var(btf, t, field_mask, info, info_cnt, 0); return -EINVAL; } /* Callers have to ensure the life cycle of btf if it is program BTF */ static int btf_parse_kptr(const struct btf *btf, struct btf_field *field, struct btf_field_info *info) { struct module *mod = NULL; const struct btf_type *t; /* If a matching btf type is found in kernel or module BTFs, kptr_ref * is that BTF, otherwise it's program BTF */ struct btf *kptr_btf; int ret; s32 id; /* Find type in map BTF, and use it to look up the matching type * in vmlinux or module BTFs, by name and kind. */ t = btf_type_by_id(btf, info->kptr.type_id); id = bpf_find_btf_id(__btf_name_by_offset(btf, t->name_off), BTF_INFO_KIND(t->info), &kptr_btf); if (id == -ENOENT) { /* btf_parse_kptr should only be called w/ btf = program BTF */ WARN_ON_ONCE(btf_is_kernel(btf)); /* Type exists only in program BTF. Assume that it's a MEM_ALLOC * kptr allocated via bpf_obj_new */ field->kptr.dtor = NULL; id = info->kptr.type_id; kptr_btf = (struct btf *)btf; goto found_dtor; } if (id < 0) return id; /* Find and stash the function pointer for the destruction function that * needs to be eventually invoked from the map free path. */ if (info->type == BPF_KPTR_REF) { const struct btf_type *dtor_func; const char *dtor_func_name; unsigned long addr; s32 dtor_btf_id; /* This call also serves as a whitelist of allowed objects that * can be used as a referenced pointer and be stored in a map at * the same time. */ dtor_btf_id = btf_find_dtor_kfunc(kptr_btf, id); if (dtor_btf_id < 0) { ret = dtor_btf_id; goto end_btf; } dtor_func = btf_type_by_id(kptr_btf, dtor_btf_id); if (!dtor_func) { ret = -ENOENT; goto end_btf; } if (btf_is_module(kptr_btf)) { mod = btf_try_get_module(kptr_btf); if (!mod) { ret = -ENXIO; goto end_btf; } } /* We already verified dtor_func to be btf_type_is_func * in register_btf_id_dtor_kfuncs. */ dtor_func_name = __btf_name_by_offset(kptr_btf, dtor_func->name_off); addr = kallsyms_lookup_name(dtor_func_name); if (!addr) { ret = -EINVAL; goto end_mod; } field->kptr.dtor = (void *)addr; } found_dtor: field->kptr.btf_id = id; field->kptr.btf = kptr_btf; field->kptr.module = mod; return 0; end_mod: module_put(mod); end_btf: btf_put(kptr_btf); return ret; } static int btf_parse_graph_root(const struct btf *btf, struct btf_field *field, struct btf_field_info *info, const char *node_type_name, size_t node_type_align) { const struct btf_type *t, *n = NULL; const struct btf_member *member; u32 offset; int i; t = btf_type_by_id(btf, info->graph_root.value_btf_id); /* We've already checked that value_btf_id is a struct type. We * just need to figure out the offset of the list_node, and * verify its type. */ for_each_member(i, t, member) { if (strcmp(info->graph_root.node_name, __btf_name_by_offset(btf, member->name_off))) continue; /* Invalid BTF, two members with same name */ if (n) return -EINVAL; n = btf_type_by_id(btf, member->type); if (!__btf_type_is_struct(n)) return -EINVAL; if (strcmp(node_type_name, __btf_name_by_offset(btf, n->name_off))) return -EINVAL; offset = __btf_member_bit_offset(n, member); if (offset % 8) return -EINVAL; offset /= 8; if (offset % node_type_align) return -EINVAL; field->graph_root.btf = (struct btf *)btf; field->graph_root.value_btf_id = info->graph_root.value_btf_id; field->graph_root.node_offset = offset; } if (!n) return -ENOENT; return 0; } static int btf_parse_list_head(const struct btf *btf, struct btf_field *field, struct btf_field_info *info) { return btf_parse_graph_root(btf, field, info, "bpf_list_node", __alignof__(struct bpf_list_node)); } static int btf_parse_rb_root(const struct btf *btf, struct btf_field *field, struct btf_field_info *info) { return btf_parse_graph_root(btf, field, info, "bpf_rb_node", __alignof__(struct bpf_rb_node)); } static int btf_field_cmp(const void *_a, const void *_b, const void *priv) { const struct btf_field *a = (const struct btf_field *)_a; const struct btf_field *b = (const struct btf_field *)_b; if (a->offset < b->offset) return -1; else if (a->offset > b->offset) return 1; return 0; } struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type *t, u32 field_mask, u32 value_size) { struct btf_field_info info_arr[BTF_FIELDS_MAX]; u32 next_off = 0, field_type_size; struct btf_record *rec; int ret, i, cnt; ret = btf_find_field(btf, t, field_mask, info_arr, ARRAY_SIZE(info_arr)); if (ret < 0) return ERR_PTR(ret); if (!ret) return NULL; cnt = ret; /* This needs to be kzalloc to zero out padding and unused fields, see * comment in btf_record_equal. */ rec = kzalloc(struct_size(rec, fields, cnt), GFP_KERNEL_ACCOUNT | __GFP_NOWARN); if (!rec) return ERR_PTR(-ENOMEM); rec->spin_lock_off = -EINVAL; rec->res_spin_lock_off = -EINVAL; rec->timer_off = -EINVAL; rec->wq_off = -EINVAL; rec->refcount_off = -EINVAL; rec->task_work_off = -EINVAL; for (i = 0; i < cnt; i++) { field_type_size = btf_field_type_size(info_arr[i].type); if (info_arr[i].off + field_type_size > value_size) { WARN_ONCE(1, "verifier bug off %d size %d", info_arr[i].off, value_size); ret = -EFAULT; goto end; } if (info_arr[i].off < next_off) { ret = -EEXIST; goto end; } next_off = info_arr[i].off + field_type_size; rec->field_mask |= info_arr[i].type; rec->fields[i].offset = info_arr[i].off; rec->fields[i].type = info_arr[i].type; rec->fields[i].size = field_type_size; switch (info_arr[i].type) { case BPF_SPIN_LOCK: WARN_ON_ONCE(rec->spin_lock_off >= 0); /* Cache offset for faster lookup at runtime */ rec->spin_lock_off = rec->fields[i].offset; break; case BPF_RES_SPIN_LOCK: WARN_ON_ONCE(rec->spin_lock_off >= 0); /* Cache offset for faster lookup at runtime */ rec->res_spin_lock_off = rec->fields[i].offset; break; case BPF_TIMER: WARN_ON_ONCE(rec->timer_off >= 0); /* Cache offset for faster lookup at runtime */ rec->timer_off = rec->fields[i].offset; break; case BPF_WORKQUEUE: WARN_ON_ONCE(rec->wq_off >= 0); /* Cache offset for faster lookup at runtime */ rec->wq_off = rec->fields[i].offset; break; case BPF_TASK_WORK: WARN_ON_ONCE(rec->task_work_off >= 0); rec->task_work_off = rec->fields[i].offset; break; case BPF_REFCOUNT: WARN_ON_ONCE(rec->refcount_off >= 0); /* Cache offset for faster lookup at runtime */ rec->refcount_off = rec->fields[i].offset; break; case BPF_KPTR_UNREF: case BPF_KPTR_REF: case BPF_KPTR_PERCPU: case BPF_UPTR: ret = btf_parse_kptr(btf, &rec->fields[i], &info_arr[i]); if (ret < 0) goto end; break; case BPF_LIST_HEAD: ret = btf_parse_list_head(btf, &rec->fields[i], &info_arr[i]); if (ret < 0) goto end; break; case BPF_RB_ROOT: ret = btf_parse_rb_root(btf, &rec->fields[i], &info_arr[i]); if (ret < 0) goto end; break; case BPF_LIST_NODE: case BPF_RB_NODE: break; default: ret = -EFAULT; goto end; } rec->cnt++; } if (rec->spin_lock_off >= 0 && rec->res_spin_lock_off >= 0) { ret = -EINVAL; goto end; } /* bpf_{list_head, rb_node} require bpf_spin_lock */ if ((btf_record_has_field(rec, BPF_LIST_HEAD) || btf_record_has_field(rec, BPF_RB_ROOT)) && (rec->spin_lock_off < 0 && rec->res_spin_lock_off < 0)) { ret = -EINVAL; goto end; } if (rec->refcount_off < 0 && btf_record_has_field(rec, BPF_LIST_NODE) && btf_record_has_field(rec, BPF_RB_NODE)) { ret = -EINVAL; goto end; } sort_r(rec->fields, rec->cnt, sizeof(struct btf_field), btf_field_cmp, NULL, rec); return rec; end: btf_record_free(rec); return ERR_PTR(ret); } int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec) { int i; /* There are three types that signify ownership of some other type: * kptr_ref, bpf_list_head, bpf_rb_root. * kptr_ref only supports storing kernel types, which can't store * references to program allocated local types. * * Hence we only need to ensure that bpf_{list_head,rb_root} ownership * does not form cycles. */ if (IS_ERR_OR_NULL(rec) || !(rec->field_mask & (BPF_GRAPH_ROOT | BPF_UPTR))) return 0; for (i = 0; i < rec->cnt; i++) { struct btf_struct_meta *meta; const struct btf_type *t; u32 btf_id; if (rec->fields[i].type == BPF_UPTR) { /* The uptr only supports pinning one page and cannot * point to a kernel struct */ if (btf_is_kernel(rec->fields[i].kptr.btf)) return -EINVAL; t = btf_type_by_id(rec->fields[i].kptr.btf, rec->fields[i].kptr.btf_id); if (!t->size) return -EINVAL; if (t->size > PAGE_SIZE) return -E2BIG; continue; } if (!(rec->fields[i].type & BPF_GRAPH_ROOT)) continue; btf_id = rec->fields[i].graph_root.value_btf_id; meta = btf_find_struct_meta(btf, btf_id); if (!meta) return -EFAULT; rec->fields[i].graph_root.value_rec = meta->record; /* We need to set value_rec for all root types, but no need * to check ownership cycle for a type unless it's also a * node type. */ if (!(rec->field_mask & BPF_GRAPH_NODE)) continue; /* We need to ensure ownership acyclicity among all types. The * proper way to do it would be to topologically sort all BTF * IDs based on the ownership edges, since there can be multiple * bpf_{list_head,rb_node} in a type. Instead, we use the * following resaoning: * * - A type can only be owned by another type in user BTF if it * has a bpf_{list,rb}_node. Let's call these node types. * - A type can only _own_ another type in user BTF if it has a * bpf_{list_head,rb_root}. Let's call these root types. * * We ensure that if a type is both a root and node, its * element types cannot be root types. * * To ensure acyclicity: * * When A is an root type but not a node, its ownership * chain can be: * A -> B -> C * Where: * - A is an root, e.g. has bpf_rb_root. * - B is both a root and node, e.g. has bpf_rb_node and * bpf_list_head. * - C is only an root, e.g. has bpf_list_node * * When A is both a root and node, some other type already * owns it in the BTF domain, hence it can not own * another root type through any of the ownership edges. * A -> B * Where: * - A is both an root and node. * - B is only an node. */ if (meta->record->field_mask & BPF_GRAPH_ROOT) return -ELOOP; } return 0; } static void __btf_struct_show(const struct btf *btf, const struct btf_type *t, u32 type_id, void *data, u8 bits_offset, struct btf_show *show) { const struct btf_member *member; void *safe_data; u32 i; safe_data = btf_show_start_struct_type(show, t, type_id, data); if (!safe_data) return; for_each_member(i, t, member) { const struct btf_type *member_type = btf_type_by_id(btf, member->type); const struct btf_kind_operations *ops; u32 member_offset, bitfield_size; u32 bytes_offset; u8 bits8_offset; btf_show_start_member(show, member); member_offset = __btf_member_bit_offset(t, member); bitfield_size = __btf_member_bitfield_size(t, member); bytes_offset = BITS_ROUNDDOWN_BYTES(member_offset); bits8_offset = BITS_PER_BYTE_MASKED(member_offset); if (bitfield_size) { safe_data = btf_show_start_type(show, member_type, member->type, data + bytes_offset); if (safe_data) btf_bitfield_show(safe_data, bits8_offset, bitfield_size, show); btf_show_end_type(show); } else { ops = btf_type_ops(member_type); ops->show(btf, member_type, member->type, data + bytes_offset, bits8_offset, show); } btf_show_end_member(show); } btf_show_end_struct_type(show); } static void btf_struct_show(const struct btf *btf, const struct btf_type *t, u32 type_id, void *data, u8 bits_offset, struct btf_show *show) { const struct btf_member *m = show->state.member; /* * First check if any members would be shown (are non-zero). * See comments above "struct btf_show" definition for more * details on how this works at a high-level. */ if (show->state.depth > 0 && !(show->flags & BTF_SHOW_ZERO)) { if (!show->state.depth_check) { show->state.depth_check = show->state.depth + 1; show->state.depth_to_show = 0; } __btf_struct_show(btf, t, type_id, data, bits_offset, show); /* Restore saved member data here */ show->state.member = m; if (show->state.depth_check != show->state.depth + 1) return; show->state.depth_check = 0; if (show->state.depth_to_show <= show->state.depth) return; /* * Reaching here indicates we have recursed and found * non-zero child values. */ } __btf_struct_show(btf, t, type_id, data, bits_offset, show); } static const struct btf_kind_operations struct_ops = { .check_meta = btf_struct_check_meta, .resolve = btf_struct_resolve, .check_member = btf_struct_check_member, .check_kflag_member = btf_generic_check_kflag_member, .log_details = btf_struct_log, .show = btf_struct_show, }; static int btf_enum_check_member(struct btf_verifier_env *env, const struct btf_type *struct_type, const struct btf_member *member, const struct btf_type *member_type) { u32 struct_bits_off = member->offset; u32 struct_size, bytes_offset; if (BITS_PER_BYTE_MASKED(struct_bits_off)) { btf_verifier_log_member(env, struct_type, member, "Member is not byte aligned"); return -EINVAL; } struct_size = struct_type->size; bytes_offset = BITS_ROUNDDOWN_BYTES(struct_bits_off); if (struct_size - bytes_offset < member_type->size) { btf_verifier_log_member(env, struct_type, member, "Member exceeds struct_size"); return -EINVAL; } return 0; } static int btf_enum_check_kflag_member(struct btf_verifier_env *env, const struct btf_type *struct_type, const struct btf_member *member, const struct btf_type *member_type) { u32 struct_bits_off, nr_bits, bytes_end, struct_size; u32 int_bitsize = sizeof(int) * BITS_PER_BYTE; struct_bits_off = BTF_MEMBER_BIT_OFFSET(member->offset); nr_bits = BTF_MEMBER_BITFIELD_SIZE(member->offset); if (!nr_bits) { if (BITS_PER_BYTE_MASKED(struct_bits_off)) { btf_verifier_log_member(env, struct_type, member, "Member is not byte aligned"); return -EINVAL; } nr_bits = int_bitsize; } else if (nr_bits > int_bitsize) { btf_verifier_log_member(env, struct_type, member, "Invalid member bitfield_size"); return -EINVAL; } struct_size = struct_type->size; bytes_end = BITS_ROUNDUP_BYTES(struct_bits_off + nr_bits); if (struct_size < bytes_end) { btf_verifier_log_member(env, struct_type, member, "Member exceeds struct_size"); return -EINVAL; } return 0; } static s32 btf_enum_check_meta(struct btf_verifier_env *env, const struct btf_type *t, u32 meta_left) { const struct btf_enum *enums = btf_type_enum(t); struct btf *btf = env->btf; const char *fmt_str; u16 i, nr_enums; u32 meta_needed; nr_enums = btf_type_vlen(t); meta_needed = nr_enums * sizeof(*enums); if (meta_left < meta_needed) { btf_verifier_log_basic(env, t, "meta_left:%u meta_needed:%u", meta_left, meta_needed); return -EINVAL; } if (t->size > 8 || !is_power_of_2(t->size)) { btf_verifier_log_type(env, t, "Unexpected size"); return -EINVAL; } /* enum type either no name or a valid one */ if (t->name_off && !btf_name_valid_identifier(env->btf, t->name_off)) { btf_verifier_log_type(env, t, "Invalid name"); return -EINVAL; } btf_verifier_log_type(env, t, NULL); for (i = 0; i < nr_enums; i++) { if (!btf_name_offset_valid(btf, enums[i].name_off)) { btf_verifier_log(env, "\tInvalid name_offset:%u", enums[i].name_off); return -EINVAL; } /* enum member must have a valid name */ if (!enums[i].name_off || !btf_name_valid_identifier(btf, enums[i].name_off)) { btf_verifier_log_type(env, t, "Invalid name"); return -EINVAL; } if (env->log.level == BPF_LOG_KERNEL) continue; fmt_str = btf_type_kflag(t) ? "\t%s val=%d\n" : "\t%s val=%u\n"; btf_verifier_log(env, fmt_str, __btf_name_by_offset(btf, enums[i].name_off), enums[i].val); } return meta_needed; } static void btf_enum_log(struct btf_verifier_env *env, const struct btf_type *t) { btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t)); } static void btf_enum_show(const struct btf *btf, const struct btf_type *t, u32 type_id, void *data, u8 bits_offset, struct btf_show *show) { const struct btf_enum *enums = btf_type_enum(t); u32 i, nr_enums = btf_type_vlen(t); void *safe_data; int v; safe_data = btf_show_start_type(show, t, type_id, data); if (!safe_data) return; v = *(int *)safe_data; for (i = 0; i < nr_enums; i++) { if (v != enums[i].val) continue; btf_show_type_value(show, "%s", __btf_name_by_offset(btf, enums[i].name_off)); btf_show_end_type(show); return; } if (btf_type_kflag(t)) btf_show_type_value(show, "%d", v); else btf_show_type_value(show, "%u", v); btf_show_end_type(show); } static const struct btf_kind_operations enum_ops = { .check_meta = btf_enum_check_meta, .resolve = btf_df_resolve, .check_member = btf_enum_check_member, .check_kflag_member = btf_enum_check_kflag_member, .log_details = btf_enum_log, .show = btf_enum_show, }; static s32 btf_enum64_check_meta(struct btf_verifier_env *env, const struct btf_type *t, u32 meta_left) { const struct btf_enum64 *enums = btf_type_enum64(t); struct btf *btf = env->btf; const char *fmt_str; u16 i, nr_enums; u32 meta_needed; nr_enums = btf_type_vlen(t); meta_needed = nr_enums * sizeof(*enums); if (meta_left < meta_needed) { btf_verifier_log_basic(env, t, "meta_left:%u meta_needed:%u", meta_left, meta_needed); return -EINVAL; } if (t->size > 8 || !is_power_of_2(t->size)) { btf_verifier_log_type(env, t, "Unexpected size"); return -EINVAL; } /* enum type either no name or a valid one */ if (t->name_off && !btf_name_valid_identifier(env->btf, t->name_off)) { btf_verifier_log_type(env, t, "Invalid name"); return -EINVAL; } btf_verifier_log_type(env, t, NULL); for (i = 0; i < nr_enums; i++) { if (!btf_name_offset_valid(btf, enums[i].name_off)) { btf_verifier_log(env, "\tInvalid name_offset:%u", enums[i].name_off); return -EINVAL; } /* enum member must have a valid name */ if (!enums[i].name_off || !btf_name_valid_identifier(btf, enums[i].name_off)) { btf_verifier_log_type(env, t, "Invalid name"); return -EINVAL; } if (env->log.level == BPF_LOG_KERNEL) continue; fmt_str = btf_type_kflag(t) ? "\t%s val=%lld\n" : "\t%s val=%llu\n"; btf_verifier_log(env, fmt_str, __btf_name_by_offset(btf, enums[i].name_off), btf_enum64_value(enums + i)); } return meta_needed; } static void btf_enum64_show(const struct btf *btf, const struct btf_type *t, u32 type_id, void *data, u8 bits_offset, struct btf_show *show) { const struct btf_enum64 *enums = btf_type_enum64(t); u32 i, nr_enums = btf_type_vlen(t); void *safe_data; s64 v; safe_data = btf_show_start_type(show, t, type_id, data); if (!safe_data) return; v = *(u64 *)safe_data; for (i = 0; i < nr_enums; i++) { if (v != btf_enum64_value(enums + i)) continue; btf_show_type_value(show, "%s", __btf_name_by_offset(btf, enums[i].name_off)); btf_show_end_type(show); return; } if (btf_type_kflag(t)) btf_show_type_value(show, "%lld", v); else btf_show_type_value(show, "%llu", v); btf_show_end_type(show); } static const struct btf_kind_operations enum64_ops = { .check_meta = btf_enum64_check_meta, .resolve = btf_df_resolve, .check_member = btf_enum_check_member, .check_kflag_member = btf_enum_check_kflag_member, .log_details = btf_enum_log, .show = btf_enum64_show, }; static s32 btf_func_proto_check_meta(struct btf_verifier_env *env, const struct btf_type *t, u32 meta_left) { u32 meta_needed = btf_type_vlen(t) * sizeof(struct btf_param); if (meta_left < meta_needed) { btf_verifier_log_basic(env, t, "meta_left:%u meta_needed:%u", meta_left, meta_needed); return -EINVAL; } if (t->name_off) { btf_verifier_log_type(env, t, "Invalid name"); return -EINVAL; } if (btf_type_kflag(t)) { btf_verifier_log_type(env, t, "Invalid btf_info kind_flag"); return -EINVAL; } btf_verifier_log_type(env, t, NULL); return meta_needed; } static void btf_func_proto_log(struct btf_verifier_env *env, const struct btf_type *t) { const struct btf_param *args = (const struct btf_param *)(t + 1); u16 nr_args = btf_type_vlen(t), i; btf_verifier_log(env, "return=%u args=(", t->type); if (!nr_args) { btf_verifier_log(env, "void"); goto done; } if (nr_args == 1 && !args[0].type) { /* Only one vararg */ btf_verifier_log(env, "vararg"); goto done; } btf_verifier_log(env, "%u %s", args[0].type, __btf_name_by_offset(env->btf, args[0].name_off)); for (i = 1; i < nr_args - 1; i++) btf_verifier_log(env, ", %u %s", args[i].type, __btf_name_by_offset(env->btf, args[i].name_off)); if (nr_args > 1) { const struct btf_param *last_arg = &args[nr_args - 1]; if (last_arg->type) btf_verifier_log(env, ", %u %s", last_arg->type, __btf_name_by_offset(env->btf, last_arg->name_off)); else btf_verifier_log(env, ", vararg"); } done: btf_verifier_log(env, ")"); } static const struct btf_kind_operations func_proto_ops = { .check_meta = btf_func_proto_check_meta, .resolve = btf_df_resolve, /* * BTF_KIND_FUNC_PROTO cannot be directly referred by * a struct's member. * * It should be a function pointer instead. * (i.e. struct's member -> BTF_KIND_PTR -> BTF_KIND_FUNC_PROTO) * * Hence, there is no btf_func_check_member(). */ .check_member = btf_df_check_member, .check_kflag_member = btf_df_check_kflag_member, .log_details = btf_func_proto_log, .show = btf_df_show, }; static s32 btf_func_check_meta(struct btf_verifier_env *env, const struct btf_type *t, u32 meta_left) { if (!t->name_off || !btf_name_valid_identifier(env->btf, t->name_off)) { btf_verifier_log_type(env, t, "Invalid name"); return -EINVAL; } if (btf_type_vlen(t) > BTF_FUNC_GLOBAL) { btf_verifier_log_type(env, t, "Invalid func linkage"); return -EINVAL; } if (btf_type_kflag(t)) { btf_verifier_log_type(env, t, "Invalid btf_info kind_flag"); return -EINVAL; } btf_verifier_log_type(env, t, NULL); return 0; } static int btf_func_resolve(struct btf_verifier_env *env, const struct resolve_vertex *v) { const struct btf_type *t = v->t; u32 next_type_id = t->type; int err; err = btf_func_check(env, t); if (err) return err; env_stack_pop_resolved(env, next_type_id, 0); return 0; } static const struct btf_kind_operations func_ops = { .check_meta = btf_func_check_meta, .resolve = btf_func_resolve, .check_member = btf_df_check_member, .check_kflag_member = btf_df_check_kflag_member, .log_details = btf_ref_type_log, .show = btf_df_show, }; static s32 btf_var_check_meta(struct btf_verifier_env *env, const struct btf_type *t, u32 meta_left) { const struct btf_var *var; u32 meta_needed = sizeof(*var); if (meta_left < meta_needed) { btf_verifier_log_basic(env, t, "meta_left:%u meta_needed:%u", meta_left, meta_needed); return -EINVAL; } if (btf_type_vlen(t)) { btf_verifier_log_type(env, t, "vlen != 0"); return -EINVAL; } if (btf_type_kflag(t)) { btf_verifier_log_type(env, t, "Invalid btf_info kind_flag"); return -EINVAL; } if (!t->name_off || !btf_name_valid_identifier(env->btf, t->name_off)) { btf_verifier_log_type(env, t, "Invalid name"); return -EINVAL; } /* A var cannot be in type void */ if (!t->type || !BTF_TYPE_ID_VALID(t->type)) { btf_verifier_log_type(env, t, "Invalid type_id"); return -EINVAL; } var = btf_type_var(t); if (var->linkage != BTF_VAR_STATIC && var->linkage != BTF_VAR_GLOBAL_ALLOCATED) { btf_verifier_log_type(env, t, "Linkage not supported"); return -EINVAL; } btf_verifier_log_type(env, t, NULL); return meta_needed; } static void btf_var_log(struct btf_verifier_env *env, const struct btf_type *t) { const struct btf_var *var = btf_type_var(t); btf_verifier_log(env, "type_id=%u linkage=%u", t->type, var->linkage); } static const struct btf_kind_operations var_ops = { .check_meta = btf_var_check_meta, .resolve = btf_var_resolve, .check_member = btf_df_check_member, .check_kflag_member = btf_df_check_kflag_member, .log_details = btf_var_log, .show = btf_var_show, }; static s32 btf_datasec_check_meta(struct btf_verifier_env *env, const struct btf_type *t, u32 meta_left) { const struct btf_var_secinfo *vsi; u64 last_vsi_end_off = 0, sum = 0; u32 i, meta_needed; meta_needed = btf_type_vlen(t) * sizeof(*vsi); if (meta_left < meta_needed) { btf_verifier_log_basic(env, t, "meta_left:%u meta_needed:%u", meta_left, meta_needed); return -EINVAL; } if (!t->size) { btf_verifier_log_type(env, t, "size == 0"); return -EINVAL; } if (btf_type_kflag(t)) { btf_verifier_log_type(env, t, "Invalid btf_info kind_flag"); return -EINVAL; } if (!t->name_off || !btf_name_valid_section(env->btf, t->name_off)) { btf_verifier_log_type(env, t, "Invalid name"); return -EINVAL; } btf_verifier_log_type(env, t, NULL); for_each_vsi(i, t, vsi) { /* A var cannot be in type void */ if (!vsi->type || !BTF_TYPE_ID_VALID(vsi->type)) { btf_verifier_log_vsi(env, t, vsi, "Invalid type_id"); return -EINVAL; } if (vsi->offset < last_vsi_end_off || vsi->offset >= t->size) { btf_verifier_log_vsi(env, t, vsi, "Invalid offset"); return -EINVAL; } if (!vsi->size || vsi->size > t->size) { btf_verifier_log_vsi(env, t, vsi, "Invalid size"); return -EINVAL; } last_vsi_end_off = vsi->offset + vsi->size; if (last_vsi_end_off > t->size) { btf_verifier_log_vsi(env, t, vsi, "Invalid offset+size"); return -EINVAL; } btf_verifier_log_vsi(env, t, vsi, NULL); sum += vsi->size; } if (t->size < sum) { btf_verifier_log_type(env, t, "Invalid btf_info size"); return -EINVAL; } return meta_needed; } static int btf_datasec_resolve(struct btf_verifier_env *env, const struct resolve_vertex *v) { const struct btf_var_secinfo *vsi; struct btf *btf = env->btf; u16 i; env->resolve_mode = RESOLVE_TBD; for_each_vsi_from(i, v->next_member, v->t, vsi) { u32 var_type_id = vsi->type, type_id, type_size = 0; const struct btf_type *var_type = btf_type_by_id(env->btf, var_type_id); if (!var_type || !btf_type_is_var(var_type)) { btf_verifier_log_vsi(env, v->t, vsi, "Not a VAR kind member"); return -EINVAL; } if (!env_type_is_resolve_sink(env, var_type) && !env_type_is_resolved(env, var_type_id)) { env_stack_set_next_member(env, i + 1); return env_stack_push(env, var_type, var_type_id); } type_id = var_type->type; if (!btf_type_id_size(btf, &type_id, &type_size)) { btf_verifier_log_vsi(env, v->t, vsi, "Invalid type"); return -EINVAL; } if (vsi->size < type_size) { btf_verifier_log_vsi(env, v->t, vsi, "Invalid size"); return -EINVAL; } } env_stack_pop_resolved(env, 0, 0); return 0; } static void btf_datasec_log(struct btf_verifier_env *env, const struct btf_type *t) { btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t)); } static void btf_datasec_show(const struct btf *btf, const struct btf_type *t, u32 type_id, void *data, u8 bits_offset, struct btf_show *show) { const struct btf_var_secinfo *vsi; const struct btf_type *var; u32 i; if (!btf_show_start_type(show, t, type_id, data)) return; btf_show_type_value(show, "section (\"%s\") = {", __btf_name_by_offset(btf, t->name_off)); for_each_vsi(i, t, vsi) { var = btf_type_by_id(btf, vsi->type); if (i) btf_show(show, ","); btf_type_ops(var)->show(btf, var, vsi->type, data + vsi->offset, bits_offset, show); } btf_show_end_type(show); } static const struct btf_kind_operations datasec_ops = { .check_meta = btf_datasec_check_meta, .resolve = btf_datasec_resolve, .check_member = btf_df_check_member, .check_kflag_member = btf_df_check_kflag_member, .log_details = btf_datasec_log, .show = btf_datasec_show, }; static s32 btf_float_check_meta(struct btf_verifier_env *env, const struct btf_type *t, u32 meta_left) { if (btf_type_vlen(t)) { btf_verifier_log_type(env, t, "vlen != 0"); return -EINVAL; } if (btf_type_kflag(t)) { btf_verifier_log_type(env, t, "Invalid btf_info kind_flag"); return -EINVAL; } if (t->size != 2 && t->size != 4 && t->size != 8 && t->size != 12 && t->size != 16) { btf_verifier_log_type(env, t, "Invalid type_size"); return -EINVAL; } btf_verifier_log_type(env, t, NULL); return 0; } static int btf_float_check_member(struct btf_verifier_env *env, const struct btf_type *struct_type, const struct btf_member *member, const struct btf_type *member_type) { u64 start_offset_bytes; u64 end_offset_bytes; u64 misalign_bits; u64 align_bytes; u64 align_bits; /* Different architectures have different alignment requirements, so * here we check only for the reasonable minimum. This way we ensure * that types after CO-RE can pass the kernel BTF verifier. */ align_bytes = min_t(u64, sizeof(void *), member_type->size); align_bits = align_bytes * BITS_PER_BYTE; div64_u64_rem(member->offset, align_bits, &misalign_bits); if (misalign_bits) { btf_verifier_log_member(env, struct_type, member, "Member is not properly aligned"); return -EINVAL; } start_offset_bytes = member->offset / BITS_PER_BYTE; end_offset_bytes = start_offset_bytes + member_type->size; if (end_offset_bytes > struct_type->size) { btf_verifier_log_member(env, struct_type, member, "Member exceeds struct_size"); return -EINVAL; } return 0; } static void btf_float_log(struct btf_verifier_env *env, const struct btf_type *t) { btf_verifier_log(env, "size=%u", t->size); } static const struct btf_kind_operations float_ops = { .check_meta = btf_float_check_meta, .resolve = btf_df_resolve, .check_member = btf_float_check_member, .check_kflag_member = btf_generic_check_kflag_member, .log_details = btf_float_log, .show = btf_df_show, }; static s32 btf_decl_tag_check_meta(struct btf_verifier_env *env, const struct btf_type *t, u32 meta_left) { const struct btf_decl_tag *tag; u32 meta_needed = sizeof(*tag); s32 component_idx; const char *value; if (meta_left < meta_needed) { btf_verifier_log_basic(env, t, "meta_left:%u meta_needed:%u", meta_left, meta_needed); return -EINVAL; } value = btf_name_by_offset(env->btf, t->name_off); if (!value || !value[0]) { btf_verifier_log_type(env, t, "Invalid value"); return -EINVAL; } if (btf_type_vlen(t)) { btf_verifier_log_type(env, t, "vlen != 0"); return -EINVAL; } component_idx = btf_type_decl_tag(t)->component_idx; if (component_idx < -1) { btf_verifier_log_type(env, t, "Invalid component_idx"); return -EINVAL; } btf_verifier_log_type(env, t, NULL); return meta_needed; } static int btf_decl_tag_resolve(struct btf_verifier_env *env, const struct resolve_vertex *v) { const struct btf_type *next_type; const struct btf_type *t = v->t; u32 next_type_id = t->type; struct btf *btf = env->btf; s32 component_idx; u32 vlen; next_type = btf_type_by_id(btf, next_type_id); if (!next_type || !btf_type_is_decl_tag_target(next_type)) { btf_verifier_log_type(env, v->t, "Invalid type_id"); return -EINVAL; } if (!env_type_is_resolve_sink(env, next_type) && !env_type_is_resolved(env, next_type_id)) return env_stack_push(env, next_type, next_type_id); component_idx = btf_type_decl_tag(t)->component_idx; if (component_idx != -1) { if (btf_type_is_var(next_type) || btf_type_is_typedef(next_type)) { btf_verifier_log_type(env, v->t, "Invalid component_idx"); return -EINVAL; } if (btf_type_is_struct(next_type)) { vlen = btf_type_vlen(next_type); } else { /* next_type should be a function */ next_type = btf_type_by_id(btf, next_type->type); vlen = btf_type_vlen(next_type); } if ((u32)component_idx >= vlen) { btf_verifier_log_type(env, v->t, "Invalid component_idx"); return -EINVAL; } } env_stack_pop_resolved(env, next_type_id, 0); return 0; } static void btf_decl_tag_log(struct btf_verifier_env *env, const struct btf_type *t) { btf_verifier_log(env, "type=%u component_idx=%d", t->type, btf_type_decl_tag(t)->component_idx); } static const struct btf_kind_operations decl_tag_ops = { .check_meta = btf_decl_tag_check_meta, .resolve = btf_decl_tag_resolve, .check_member = btf_df_check_member, .check_kflag_member = btf_df_check_kflag_member, .log_details = btf_decl_tag_log, .show = btf_df_show, }; static int btf_func_proto_check(struct btf_verifier_env *env, const struct btf_type *t) { const struct btf_type *ret_type; const struct btf_param *args; const struct btf *btf; u16 nr_args, i; int err; btf = env->btf; args = (const struct btf_param *)(t + 1); nr_args = btf_type_vlen(t); /* Check func return type which could be "void" (t->type == 0) */ if (t->type) { u32 ret_type_id = t->type; ret_type = btf_type_by_id(btf, ret_type_id); if (!ret_type) { btf_verifier_log_type(env, t, "Invalid return type"); return -EINVAL; } if (btf_type_is_resolve_source_only(ret_type)) { btf_verifier_log_type(env, t, "Invalid return type"); return -EINVAL; } if (btf_type_needs_resolve(ret_type) && !env_type_is_resolved(env, ret_type_id)) { err = btf_resolve(env, ret_type, ret_type_id); if (err) return err; } /* Ensure the return type is a type that has a size */ if (!btf_type_id_size(btf, &ret_type_id, NULL)) { btf_verifier_log_type(env, t, "Invalid return type"); return -EINVAL; } } if (!nr_args) return 0; /* Last func arg type_id could be 0 if it is a vararg */ if (!args[nr_args - 1].type) { if (args[nr_args - 1].name_off) { btf_verifier_log_type(env, t, "Invalid arg#%u", nr_args); return -EINVAL; } nr_args--; } for (i = 0; i < nr_args; i++) { const struct btf_type *arg_type; u32 arg_type_id; arg_type_id = args[i].type; arg_type = btf_type_by_id(btf, arg_type_id); if (!arg_type) { btf_verifier_log_type(env, t, "Invalid arg#%u", i + 1); return -EINVAL; } if (btf_type_is_resolve_source_only(arg_type)) { btf_verifier_log_type(env, t, "Invalid arg#%u", i + 1); return -EINVAL; } if (args[i].name_off && (!btf_name_offset_valid(btf, args[i].name_off) || !btf_name_valid_identifier(btf, args[i].name_off))) { btf_verifier_log_type(env, t, "Invalid arg#%u", i + 1); return -EINVAL; } if (btf_type_needs_resolve(arg_type) && !env_type_is_resolved(env, arg_type_id)) { err = btf_resolve(env, arg_type, arg_type_id); if (err) return err; } if (!btf_type_id_size(btf, &arg_type_id, NULL)) { btf_verifier_log_type(env, t, "Invalid arg#%u", i + 1); return -EINVAL; } } return 0; } static int btf_func_check(struct btf_verifier_env *env, const struct btf_type *t) { const struct btf_type *proto_type; const struct btf_param *args; const struct btf *btf; u16 nr_args, i; btf = env->btf; proto_type = btf_type_by_id(btf, t->type); if (!proto_type || !btf_type_is_func_proto(proto_type)) { btf_verifier_log_type(env, t, "Invalid type_id"); return -EINVAL; } args = (const struct btf_param *)(proto_type + 1); nr_args = btf_type_vlen(proto_type); for (i = 0; i < nr_args; i++) { if (!args[i].name_off && args[i].type) { btf_verifier_log_type(env, t, "Invalid arg#%u", i + 1); return -EINVAL; } } return 0; } static const struct btf_kind_operations * const kind_ops[NR_BTF_KINDS] = { [BTF_KIND_INT] = &int_ops, [BTF_KIND_PTR] = &ptr_ops, [BTF_KIND_ARRAY] = &array_ops, [BTF_KIND_STRUCT] = &struct_ops, [BTF_KIND_UNION] = &struct_ops, [BTF_KIND_ENUM] = &enum_ops, [BTF_KIND_FWD] = &fwd_ops, [BTF_KIND_TYPEDEF] = &modifier_ops, [BTF_KIND_VOLATILE] = &modifier_ops, [BTF_KIND_CONST] = &modifier_ops, [BTF_KIND_RESTRICT] = &modifier_ops, [BTF_KIND_FUNC] = &func_ops, [BTF_KIND_FUNC_PROTO] = &func_proto_ops, [BTF_KIND_VAR] = &var_ops, [BTF_KIND_DATASEC] = &datasec_ops, [BTF_KIND_FLOAT] = &float_ops, [BTF_KIND_DECL_TAG] = &decl_tag_ops, [BTF_KIND_TYPE_TAG] = &modifier_ops, [BTF_KIND_ENUM64] = &enum64_ops, }; static s32 btf_check_meta(struct btf_verifier_env *env, const struct btf_type *t, u32 meta_left) { u32 saved_meta_left = meta_left; s32 var_meta_size; if (meta_left < sizeof(*t)) { btf_verifier_log(env, "[%u] meta_left:%u meta_needed:%zu", env->log_type_id, meta_left, sizeof(*t)); return -EINVAL; } meta_left -= sizeof(*t); if (t->info & ~BTF_INFO_MASK) { btf_verifier_log(env, "[%u] Invalid btf_info:%x", env->log_type_id, t->info); return -EINVAL; } if (BTF_INFO_KIND(t->info) > BTF_KIND_MAX || BTF_INFO_KIND(t->info) == BTF_KIND_UNKN) { btf_verifier_log(env, "[%u] Invalid kind:%u", env->log_type_id, BTF_INFO_KIND(t->info)); return -EINVAL; } if (!btf_name_offset_valid(env->btf, t->name_off)) { btf_verifier_log(env, "[%u] Invalid name_offset:%u", env->log_type_id, t->name_off); return -EINVAL; } var_meta_size = btf_type_ops(t)->check_meta(env, t, meta_left); if (var_meta_size < 0) return var_meta_size; meta_left -= var_meta_size; return saved_meta_left - meta_left; } static int btf_check_all_metas(struct btf_verifier_env *env) { struct btf *btf = env->btf; struct btf_header *hdr; void *cur, *end; hdr = &btf->hdr; cur = btf->nohdr_data + hdr->type_off; end = cur + hdr->type_len; env->log_type_id = btf->base_btf ? btf->start_id : 1; while (cur < end) { struct btf_type *t = cur; s32 meta_size; meta_size = btf_check_meta(env, t, end - cur); if (meta_size < 0) return meta_size; btf_add_type(env, t); cur += meta_size; env->log_type_id++; } return 0; } static bool btf_resolve_valid(struct btf_verifier_env *env, const struct btf_type *t, u32 type_id) { struct btf *btf = env->btf; if (!env_type_is_resolved(env, type_id)) return false; if (btf_type_is_struct(t) || btf_type_is_datasec(t)) return !btf_resolved_type_id(btf, type_id) && !btf_resolved_type_size(btf, type_id); if (btf_type_is_decl_tag(t) || btf_type_is_func(t)) return btf_resolved_type_id(btf, type_id) && !btf_resolved_type_size(btf, type_id); if (btf_type_is_modifier(t) || btf_type_is_ptr(t) || btf_type_is_var(t)) { t = btf_type_id_resolve(btf, &type_id); return t && !btf_type_is_modifier(t) && !btf_type_is_var(t) && !btf_type_is_datasec(t); } if (btf_type_is_array(t)) { const struct btf_array *array = btf_type_array(t); const struct btf_type *elem_type; u32 elem_type_id = array->type; u32 elem_size; elem_type = btf_type_id_size(btf, &elem_type_id, &elem_size); return elem_type && !btf_type_is_modifier(elem_type) && (array->nelems * elem_size == btf_resolved_type_size(btf, type_id)); } return false; } static int btf_resolve(struct btf_verifier_env *env, const struct btf_type *t, u32 type_id) { u32 save_log_type_id = env->log_type_id; const struct resolve_vertex *v; int err = 0; env->resolve_mode = RESOLVE_TBD; env_stack_push(env, t, type_id); while (!err && (v = env_stack_peak(env))) { env->log_type_id = v->type_id; err = btf_type_ops(v->t)->resolve(env, v); } env->log_type_id = type_id; if (err == -E2BIG) { btf_verifier_log_type(env, t, "Exceeded max resolving depth:%u", MAX_RESOLVE_DEPTH); } else if (err == -EEXIST) { btf_verifier_log_type(env, t, "Loop detected"); } /* Final sanity check */ if (!err && !btf_resolve_valid(env, t, type_id)) { btf_verifier_log_type(env, t, "Invalid resolve state"); err = -EINVAL; } env->log_type_id = save_log_type_id; return err; } static int btf_check_all_types(struct btf_verifier_env *env) { struct btf *btf = env->btf; const struct btf_type *t; u32 type_id, i; int err; err = env_resolve_init(env); if (err) return err; env->phase++; for (i = btf->base_btf ? 0 : 1; i < btf->nr_types; i++) { type_id = btf->start_id + i; t = btf_type_by_id(btf, type_id); env->log_type_id = type_id; if (btf_type_needs_resolve(t) && !env_type_is_resolved(env, type_id)) { err = btf_resolve(env, t, type_id); if (err) return err; } if (btf_type_is_func_proto(t)) { err = btf_func_proto_check(env, t); if (err) return err; } } return 0; } static int btf_parse_type_sec(struct btf_verifier_env *env) { const struct btf_header *hdr = &env->btf->hdr; int err; /* Type section must align to 4 bytes */ if (hdr->type_off & (sizeof(u32) - 1)) { btf_verifier_log(env, "Unaligned type_off"); return -EINVAL; } if (!env->btf->base_btf && !hdr->type_len) { btf_verifier_log(env, "No type found"); return -EINVAL; } err = btf_check_all_metas(env); if (err) return err; return btf_check_all_types(env); } static int btf_parse_str_sec(struct btf_verifier_env *env) { const struct btf_header *hdr; struct btf *btf = env->btf; const char *start, *end; hdr = &btf->hdr; start = btf->nohdr_data + hdr->str_off; end = start + hdr->str_len; if (end != btf->data + btf->data_size) { btf_verifier_log(env, "String section is not at the end"); return -EINVAL; } btf->strings = start; if (btf->base_btf && !hdr->str_len) return 0; if (!hdr->str_len || hdr->str_len - 1 > BTF_MAX_NAME_OFFSET || end[-1]) { btf_verifier_log(env, "Invalid string section"); return -EINVAL; } if (!btf->base_btf && start[0]) { btf_verifier_log(env, "Invalid string section"); return -EINVAL; } return 0; } static const size_t btf_sec_info_offset[] = { offsetof(struct btf_header, type_off), offsetof(struct btf_header, str_off), }; static int btf_sec_info_cmp(const void *a, const void *b) { const struct btf_sec_info *x = a; const struct btf_sec_info *y = b; return (int)(x->off - y->off) ? : (int)(x->len - y->len); } static int btf_check_sec_info(struct btf_verifier_env *env, u32 btf_data_size) { struct btf_sec_info secs[ARRAY_SIZE(btf_sec_info_offset)]; u32 total, expected_total, i; const struct btf_header *hdr; const struct btf *btf; btf = env->btf; hdr = &btf->hdr; /* Populate the secs from hdr */ for (i = 0; i < ARRAY_SIZE(btf_sec_info_offset); i++) secs[i] = *(struct btf_sec_info *)((void *)hdr + btf_sec_info_offset[i]); sort(secs, ARRAY_SIZE(btf_sec_info_offset), sizeof(struct btf_sec_info), btf_sec_info_cmp, NULL); /* Check for gaps and overlap among sections */ total = 0; expected_total = btf_data_size - hdr->hdr_len; for (i = 0; i < ARRAY_SIZE(btf_sec_info_offset); i++) { if (expected_total < secs[i].off) { btf_verifier_log(env, "Invalid section offset"); return -EINVAL; } if (total < secs[i].off) { /* gap */ btf_verifier_log(env, "Unsupported section found"); return -EINVAL; } if (total > secs[i].off) { btf_verifier_log(env, "Section overlap found"); return -EINVAL; } if (expected_total - total < secs[i].len) { btf_verifier_log(env, "Total section length too long"); return -EINVAL; } total += secs[i].len; } /* There is data other than hdr and known sections */ if (expected_total != total) { btf_verifier_log(env, "Unsupported section found"); return -EINVAL; } return 0; } static int btf_parse_hdr(struct btf_verifier_env *env) { u32 hdr_len, hdr_copy, btf_data_size; const struct btf_header *hdr; struct btf *btf; btf = env->btf; btf_data_size = btf->data_size; if (btf_data_size < offsetofend(struct btf_header, hdr_len)) { btf_verifier_log(env, "hdr_len not found"); return -EINVAL; } hdr = btf->data; hdr_len = hdr->hdr_len; if (btf_data_size < hdr_len) { btf_verifier_log(env, "btf_header not found"); return -EINVAL; } /* Ensure the unsupported header fields are zero */ if (hdr_len > sizeof(btf->hdr)) { u8 *expected_zero = btf->data + sizeof(btf->hdr); u8 *end = btf->data + hdr_len; for (; expected_zero < end; expected_zero++) { if (*expected_zero) { btf_verifier_log(env, "Unsupported btf_header"); return -E2BIG; } } } hdr_copy = min_t(u32, hdr_len, sizeof(btf->hdr)); memcpy(&btf->hdr, btf->data, hdr_copy); hdr = &btf->hdr; btf_verifier_log_hdr(env, btf_data_size); if (hdr->magic != BTF_MAGIC) { btf_verifier_log(env, "Invalid magic"); return -EINVAL; } if (hdr->version != BTF_VERSION) { btf_verifier_log(env, "Unsupported version"); return -ENOTSUPP; } if (hdr->flags) { btf_verifier_log(env, "Unsupported flags"); return -ENOTSUPP; } if (!btf->base_btf && btf_data_size == hdr->hdr_len) { btf_verifier_log(env, "No data"); return -EINVAL; } return btf_check_sec_info(env, btf_data_size); } static const char *alloc_obj_fields[] = { "bpf_spin_lock", "bpf_list_head", "bpf_list_node", "bpf_rb_root", "bpf_rb_node", "bpf_refcount", }; static struct btf_struct_metas * btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf) { struct btf_struct_metas *tab = NULL; struct btf_id_set *aof; int i, n, id, ret; BUILD_BUG_ON(offsetof(struct btf_id_set, cnt) != 0); BUILD_BUG_ON(sizeof(struct btf_id_set) != sizeof(u32)); aof = kmalloc(sizeof(*aof), GFP_KERNEL | __GFP_NOWARN); if (!aof) return ERR_PTR(-ENOMEM); aof->cnt = 0; for (i = 0; i < ARRAY_SIZE(alloc_obj_fields); i++) { /* Try to find whether this special type exists in user BTF, and * if so remember its ID so we can easily find it among members * of structs that we iterate in the next loop. */ struct btf_id_set *new_aof; id = btf_find_by_name_kind(btf, alloc_obj_fields[i], BTF_KIND_STRUCT); if (id < 0) continue; new_aof = krealloc(aof, struct_size(new_aof, ids, aof->cnt + 1), GFP_KERNEL | __GFP_NOWARN); if (!new_aof) { ret = -ENOMEM; goto free_aof; } aof = new_aof; aof->ids[aof->cnt++] = id; } n = btf_nr_types(btf); for (i = 1; i < n; i++) { /* Try to find if there are kptrs in user BTF and remember their ID */ struct btf_id_set *new_aof; struct btf_field_info tmp; const struct btf_type *t; t = btf_type_by_id(btf, i); if (!t) { ret = -EINVAL; goto free_aof; } ret = btf_find_kptr(btf, t, 0, 0, &tmp, BPF_KPTR); if (ret != BTF_FIELD_FOUND) continue; new_aof = krealloc(aof, struct_size(new_aof, ids, aof->cnt + 1), GFP_KERNEL | __GFP_NOWARN); if (!new_aof) { ret = -ENOMEM; goto free_aof; } aof = new_aof; aof->ids[aof->cnt++] = i; } if (!aof->cnt) { kfree(aof); return NULL; } sort(&aof->ids, aof->cnt, sizeof(aof->ids[0]), btf_id_cmp_func, NULL); for (i = 1; i < n; i++) { struct btf_struct_metas *new_tab; const struct btf_member *member; struct btf_struct_meta *type; struct btf_record *record; const struct btf_type *t; int j, tab_cnt; t = btf_type_by_id(btf, i); if (!__btf_type_is_struct(t)) continue; cond_resched(); for_each_member(j, t, member) { if (btf_id_set_contains(aof, member->type)) goto parse; } continue; parse: tab_cnt = tab ? tab->cnt : 0; new_tab = krealloc(tab, struct_size(new_tab, types, tab_cnt + 1), GFP_KERNEL | __GFP_NOWARN); if (!new_tab) { ret = -ENOMEM; goto free; } if (!tab) new_tab->cnt = 0; tab = new_tab; type = &tab->types[tab->cnt]; type->btf_id = i; record = btf_parse_fields(btf, t, BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK | BPF_LIST_HEAD | BPF_LIST_NODE | BPF_RB_ROOT | BPF_RB_NODE | BPF_REFCOUNT | BPF_KPTR, t->size); /* The record cannot be unset, treat it as an error if so */ if (IS_ERR_OR_NULL(record)) { ret = PTR_ERR_OR_ZERO(record) ?: -EFAULT; goto free; } type->record = record; tab->cnt++; } kfree(aof); return tab; free: btf_struct_metas_free(tab); free_aof: kfree(aof); return ERR_PTR(ret); } struct btf_struct_meta *btf_find_struct_meta(const struct btf *btf, u32 btf_id) { struct btf_struct_metas *tab; BUILD_BUG_ON(offsetof(struct btf_struct_meta, btf_id) != 0); tab = btf->struct_meta_tab; if (!tab) return NULL; return bsearch(&btf_id, tab->types, tab->cnt, sizeof(tab->types[0]), btf_id_cmp_func); } static int btf_check_type_tags(struct btf_verifier_env *env, struct btf *btf, int start_id) { int i, n, good_id = start_id - 1; bool in_tags; n = btf_nr_types(btf); for (i = start_id; i < n; i++) { const struct btf_type *t; int chain_limit = 32; u32 cur_id = i; t = btf_type_by_id(btf, i); if (!t) return -EINVAL; if (!btf_type_is_modifier(t)) continue; cond_resched(); in_tags = btf_type_is_type_tag(t); while (btf_type_is_modifier(t)) { if (!chain_limit--) { btf_verifier_log(env, "Max chain length or cycle detected"); return -ELOOP; } if (btf_type_is_type_tag(t)) { if (!in_tags) { btf_verifier_log(env, "Type tags don't precede modifiers"); return -EINVAL; } } else if (in_tags) { in_tags = false; } if (cur_id <= good_id) break; /* Move to next type */ cur_id = t->type; t = btf_type_by_id(btf, cur_id); if (!t) return -EINVAL; } good_id = i; } return 0; } static int finalize_log(struct bpf_verifier_log *log, bpfptr_t uattr, u32 uattr_size) { u32 log_true_size; int err; err = bpf_vlog_finalize(log, &log_true_size); if (uattr_size >= offsetofend(union bpf_attr, btf_log_true_size) && copy_to_bpfptr_offset(uattr, offsetof(union bpf_attr, btf_log_true_size), &log_true_size, sizeof(log_true_size))) err = -EFAULT; return err; } static struct btf *btf_parse(const union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size) { bpfptr_t btf_data = make_bpfptr(attr->btf, uattr.is_kernel); char __user *log_ubuf = u64_to_user_ptr(attr->btf_log_buf); struct btf_struct_metas *struct_meta_tab; struct btf_verifier_env *env = NULL; struct btf *btf = NULL; u8 *data; int err, ret; if (attr->btf_size > BTF_MAX_SIZE) return ERR_PTR(-E2BIG); env = kzalloc(sizeof(*env), GFP_KERNEL | __GFP_NOWARN); if (!env) return ERR_PTR(-ENOMEM); /* user could have requested verbose verifier output * and supplied buffer to store the verification trace */ err = bpf_vlog_init(&env->log, attr->btf_log_level, log_ubuf, attr->btf_log_size); if (err) goto errout_free; btf = kzalloc(sizeof(*btf), GFP_KERNEL | __GFP_NOWARN); if (!btf) { err = -ENOMEM; goto errout; } env->btf = btf; data = kvmalloc(attr->btf_size, GFP_KERNEL | __GFP_NOWARN); if (!data) { err = -ENOMEM; goto errout; } btf->data = data; btf->data_size = attr->btf_size; if (copy_from_bpfptr(data, btf_data, attr->btf_size)) { err = -EFAULT; goto errout; } err = btf_parse_hdr(env); if (err) goto errout; btf->nohdr_data = btf->data + btf->hdr.hdr_len; err = btf_parse_str_sec(env); if (err) goto errout; err = btf_parse_type_sec(env); if (err) goto errout; err = btf_check_type_tags(env, btf, 1); if (err) goto errout; struct_meta_tab = btf_parse_struct_metas(&env->log, btf); if (IS_ERR(struct_meta_tab)) { err = PTR_ERR(struct_meta_tab); goto errout; } btf->struct_meta_tab = struct_meta_tab; if (struct_meta_tab) { int i; for (i = 0; i < struct_meta_tab->cnt; i++) { err = btf_check_and_fixup_fields(btf, struct_meta_tab->types[i].record); if (err < 0) goto errout_meta; } } err = finalize_log(&env->log, uattr, uattr_size); if (err) goto errout_free; btf_verifier_env_free(env); refcount_set(&btf->refcnt, 1); return btf; errout_meta: btf_free_struct_meta_tab(btf); errout: /* overwrite err with -ENOSPC or -EFAULT */ ret = finalize_log(&env->log, uattr, uattr_size); if (ret) err = ret; errout_free: btf_verifier_env_free(env); if (btf) btf_free(btf); return ERR_PTR(err); } extern char __start_BTF[]; extern char __stop_BTF[]; extern struct btf *btf_vmlinux; #define BPF_MAP_TYPE(_id, _ops) #define BPF_LINK_TYPE(_id, _name) static union { struct bpf_ctx_convert { #define BPF_PROG_TYPE(_id, _name, prog_ctx_type, kern_ctx_type) \ prog_ctx_type _id##_prog; \ kern_ctx_type _id##_kern; #include <linux/bpf_types.h> #undef BPF_PROG_TYPE } *__t; /* 't' is written once under lock. Read many times. */ const struct btf_type *t; } bpf_ctx_convert; enum { #define BPF_PROG_TYPE(_id, _name, prog_ctx_type, kern_ctx_type) \ __ctx_convert##_id, #include <linux/bpf_types.h> #undef BPF_PROG_TYPE __ctx_convert_unused, /* to avoid empty enum in extreme .config */ }; static u8 bpf_ctx_convert_map[] = { #define BPF_PROG_TYPE(_id, _name, prog_ctx_type, kern_ctx_type) \ [_id] = __ctx_convert##_id, #include <linux/bpf_types.h> #undef BPF_PROG_TYPE 0, /* avoid empty array */ }; #undef BPF_MAP_TYPE #undef BPF_LINK_TYPE static const struct btf_type *find_canonical_prog_ctx_type(enum bpf_prog_type prog_type) { const struct btf_type *conv_struct; const struct btf_member *ctx_type; conv_struct = bpf_ctx_convert.t; if (!conv_struct) return NULL; /* prog_type is valid bpf program type. No need for bounds check. */ ctx_type = btf_type_member(conv_struct) + bpf_ctx_convert_map[prog_type] * 2; /* ctx_type is a pointer to prog_ctx_type in vmlinux. * Like 'struct __sk_buff' */ return btf_type_by_id(btf_vmlinux, ctx_type->type); } static int find_kern_ctx_type_id(enum bpf_prog_type prog_type) { const struct btf_type *conv_struct; const struct btf_member *ctx_type; conv_struct = bpf_ctx_convert.t; if (!conv_struct) return -EFAULT; /* prog_type is valid bpf program type. No need for bounds check. */ ctx_type = btf_type_member(conv_struct) + bpf_ctx_convert_map[prog_type] * 2 + 1; /* ctx_type is a pointer to prog_ctx_type in vmlinux. * Like 'struct sk_buff' */ return ctx_type->type; } bool btf_is_projection_of(const char *pname, const char *tname) { if (strcmp(pname, "__sk_buff") == 0 && strcmp(tname, "sk_buff") == 0) return true; if (strcmp(pname, "xdp_md") == 0 && strcmp(tname, "xdp_buff") == 0) return true; return false; } bool btf_is_prog_ctx_type(struct bpf_verifier_log *log, const struct btf *btf, const struct btf_type *t, enum bpf_prog_type prog_type, int arg) { const struct btf_type *ctx_type; const char *tname, *ctx_tname; t = btf_type_by_id(btf, t->type); /* KPROBE programs allow bpf_user_pt_regs_t typedef, which we need to * check before we skip all the typedef below. */ if (prog_type == BPF_PROG_TYPE_KPROBE) { while (btf_type_is_modifier(t) && !btf_type_is_typedef(t)) t = btf_type_by_id(btf, t->type); if (btf_type_is_typedef(t)) { tname = btf_name_by_offset(btf, t->name_off); if (tname && strcmp(tname, "bpf_user_pt_regs_t") == 0) return true; } } while (btf_type_is_modifier(t)) t = btf_type_by_id(btf, t->type); if (!btf_type_is_struct(t)) { /* Only pointer to struct is supported for now. * That means that BPF_PROG_TYPE_TRACEPOINT with BTF * is not supported yet. * BPF_PROG_TYPE_RAW_TRACEPOINT is fine. */ return false; } tname = btf_name_by_offset(btf, t->name_off); if (!tname) { bpf_log(log, "arg#%d struct doesn't have a name\n", arg); return false; } ctx_type = find_canonical_prog_ctx_type(prog_type); if (!ctx_type) { bpf_log(log, "btf_vmlinux is malformed\n"); /* should not happen */ return false; } again: ctx_tname = btf_name_by_offset(btf_vmlinux, ctx_type->name_off); if (!ctx_tname) { /* should not happen */ bpf_log(log, "Please fix kernel include/linux/bpf_types.h\n"); return false; } /* program types without named context types work only with arg:ctx tag */ if (ctx_tname[0] == '\0') return false; /* only compare that prog's ctx type name is the same as * kernel expects. No need to compare field by field. * It's ok for bpf prog to do: * struct __sk_buff {}; * int socket_filter_bpf_prog(struct __sk_buff *skb) * { // no fields of skb are ever used } */ if (btf_is_projection_of(ctx_tname, tname)) return true; if (strcmp(ctx_tname, tname)) { /* bpf_user_pt_regs_t is a typedef, so resolve it to * underlying struct and check name again */ if (!btf_type_is_modifier(ctx_type)) return false; while (btf_type_is_modifier(ctx_type)) ctx_type = btf_type_by_id(btf_vmlinux, ctx_type->type); goto again; } return true; } /* forward declarations for arch-specific underlying types of * bpf_user_pt_regs_t; this avoids the need for arch-specific #ifdef * compilation guards below for BPF_PROG_TYPE_PERF_EVENT checks, but still * works correctly with __builtin_types_compatible_p() on respective * architectures */ struct user_regs_struct; struct user_pt_regs; static int btf_validate_prog_ctx_type(struct bpf_verifier_log *log, const struct btf *btf, const struct btf_type *t, int arg, enum bpf_prog_type prog_type, enum bpf_attach_type attach_type) { const struct btf_type *ctx_type; const char *tname, *ctx_tname; if (!btf_is_ptr(t)) { bpf_log(log, "arg#%d type isn't a pointer\n", arg); return -EINVAL; } t = btf_type_by_id(btf, t->type); /* KPROBE and PERF_EVENT programs allow bpf_user_pt_regs_t typedef */ if (prog_type == BPF_PROG_TYPE_KPROBE || prog_type == BPF_PROG_TYPE_PERF_EVENT) { while (btf_type_is_modifier(t) && !btf_type_is_typedef(t)) t = btf_type_by_id(btf, t->type); if (btf_type_is_typedef(t)) { tname = btf_name_by_offset(btf, t->name_off); if (tname && strcmp(tname, "bpf_user_pt_regs_t") == 0) return 0; } } /* all other program types don't use typedefs for context type */ while (btf_type_is_modifier(t)) t = btf_type_by_id(btf, t->type); /* `void *ctx __arg_ctx` is always valid */ if (btf_type_is_void(t)) return 0; tname = btf_name_by_offset(btf, t->name_off); if (str_is_empty(tname)) { bpf_log(log, "arg#%d type doesn't have a name\n", arg); return -EINVAL; } /* special cases */ switch (prog_type) { case BPF_PROG_TYPE_KPROBE: if (__btf_type_is_struct(t) && strcmp(tname, "pt_regs") == 0) return 0; break; case BPF_PROG_TYPE_PERF_EVENT: if (__builtin_types_compatible_p(bpf_user_pt_regs_t, struct pt_regs) && __btf_type_is_struct(t) && strcmp(tname, "pt_regs") == 0) return 0; if (__builtin_types_compatible_p(bpf_user_pt_regs_t, struct user_pt_regs) && __btf_type_is_struct(t) && strcmp(tname, "user_pt_regs") == 0) return 0; if (__builtin_types_compatible_p(bpf_user_pt_regs_t, struct user_regs_struct) && __btf_type_is_struct(t) && strcmp(tname, "user_regs_struct") == 0) return 0; break; case BPF_PROG_TYPE_RAW_TRACEPOINT: case BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE: /* allow u64* as ctx */ if (btf_is_int(t) && t->size == 8) return 0; break; case BPF_PROG_TYPE_TRACING: switch (attach_type) { case BPF_TRACE_RAW_TP: /* tp_btf program is TRACING, so need special case here */ if (__btf_type_is_struct(t) && strcmp(tname, "bpf_raw_tracepoint_args") == 0) return 0; /* allow u64* as ctx */ if (btf_is_int(t) && t->size == 8) return 0; break; case BPF_TRACE_ITER: /* allow struct bpf_iter__xxx types only */ if (__btf_type_is_struct(t) && strncmp(tname, "bpf_iter__", sizeof("bpf_iter__") - 1) == 0) return 0; break; case BPF_TRACE_FENTRY: case BPF_TRACE_FEXIT: case BPF_MODIFY_RETURN: /* allow u64* as ctx */ if (btf_is_int(t) && t->size == 8) return 0; break; default: break; } break; case BPF_PROG_TYPE_LSM: case BPF_PROG_TYPE_STRUCT_OPS: /* allow u64* as ctx */ if (btf_is_int(t) && t->size == 8) return 0; break; case BPF_PROG_TYPE_TRACEPOINT: case BPF_PROG_TYPE_SYSCALL: case BPF_PROG_TYPE_EXT: return 0; /* anything goes */ default: break; } ctx_type = find_canonical_prog_ctx_type(prog_type); if (!ctx_type) { /* should not happen */ bpf_log(log, "btf_vmlinux is malformed\n"); return -EINVAL; } /* resolve typedefs and check that underlying structs are matching as well */ while (btf_type_is_modifier(ctx_type)) ctx_type = btf_type_by_id(btf_vmlinux, ctx_type->type); /* if program type doesn't have distinctly named struct type for * context, then __arg_ctx argument can only be `void *`, which we * already checked above */ if (!__btf_type_is_struct(ctx_type)) { bpf_log(log, "arg#%d should be void pointer\n", arg); return -EINVAL; } ctx_tname = btf_name_by_offset(btf_vmlinux, ctx_type->name_off); if (!__btf_type_is_struct(t) || strcmp(ctx_tname, tname) != 0) { bpf_log(log, "arg#%d should be `struct %s *`\n", arg, ctx_tname); return -EINVAL; } return 0; } static int btf_translate_to_vmlinux(struct bpf_verifier_log *log, struct btf *btf, const struct btf_type *t, enum bpf_prog_type prog_type, int arg) { if (!btf_is_prog_ctx_type(log, btf, t, prog_type, arg)) return -ENOENT; return find_kern_ctx_type_id(prog_type); } int get_kern_ctx_btf_id(struct bpf_verifier_log *log, enum bpf_prog_type prog_type) { const struct btf_member *kctx_member; const struct btf_type *conv_struct; const struct btf_type *kctx_type; u32 kctx_type_id; conv_struct = bpf_ctx_convert.t; /* get member for kernel ctx type */ kctx_member = btf_type_member(conv_struct) + bpf_ctx_convert_map[prog_type] * 2 + 1; kctx_type_id = kctx_member->type; kctx_type = btf_type_by_id(btf_vmlinux, kctx_type_id); if (!btf_type_is_struct(kctx_type)) { bpf_log(log, "kern ctx type id %u is not a struct\n", kctx_type_id); return -EINVAL; } return kctx_type_id; } BTF_ID_LIST_SINGLE(bpf_ctx_convert_btf_id, struct, bpf_ctx_convert) static struct btf *btf_parse_base(struct btf_verifier_env *env, const char *name, void *data, unsigned int data_size) { struct btf *btf = NULL; int err; if (!IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) return ERR_PTR(-ENOENT); btf = kzalloc(sizeof(*btf), GFP_KERNEL | __GFP_NOWARN); if (!btf) { err = -ENOMEM; goto errout; } env->btf = btf; btf->data = data; btf->data_size = data_size; btf->kernel_btf = true; snprintf(btf->name, sizeof(btf->name), "%s", name); err = btf_parse_hdr(env); if (err) goto errout; btf->nohdr_data = btf->data + btf->hdr.hdr_len; err = btf_parse_str_sec(env); if (err) goto errout; err = btf_check_all_metas(env); if (err) goto errout; err = btf_check_type_tags(env, btf, 1); if (err) goto errout; refcount_set(&btf->refcnt, 1); return btf; errout: if (btf) { kvfree(btf->types); kfree(btf); } return ERR_PTR(err); } struct btf *btf_parse_vmlinux(void) { struct btf_verifier_env *env = NULL; struct bpf_verifier_log *log; struct btf *btf; int err; env = kzalloc(sizeof(*env), GFP_KERNEL | __GFP_NOWARN); if (!env) return ERR_PTR(-ENOMEM); log = &env->log; log->level = BPF_LOG_KERNEL; btf = btf_parse_base(env, "vmlinux", __start_BTF, __stop_BTF - __start_BTF); if (IS_ERR(btf)) goto err_out; /* btf_parse_vmlinux() runs under bpf_verifier_lock */ bpf_ctx_convert.t = btf_type_by_id(btf, bpf_ctx_convert_btf_id[0]); err = btf_alloc_id(btf); if (err) { btf_free(btf); btf = ERR_PTR(err); } err_out: btf_verifier_env_free(env); return btf; } /* If .BTF_ids section was created with distilled base BTF, both base and * split BTF ids will need to be mapped to actual base/split ids for * BTF now that it has been relocated. */ static __u32 btf_relocate_id(const struct btf *btf, __u32 id) { if (!btf->base_btf || !btf->base_id_map) return id; return btf->base_id_map[id]; } #ifdef CONFIG_DEBUG_INFO_BTF_MODULES static struct btf *btf_parse_module(const char *module_name, const void *data, unsigned int data_size, void *base_data, unsigned int base_data_size) { struct btf *btf = NULL, *vmlinux_btf, *base_btf = NULL; struct btf_verifier_env *env = NULL; struct bpf_verifier_log *log; int err = 0; vmlinux_btf = bpf_get_btf_vmlinux(); if (IS_ERR(vmlinux_btf)) return vmlinux_btf; if (!vmlinux_btf) return ERR_PTR(-EINVAL); env = kzalloc(sizeof(*env), GFP_KERNEL | __GFP_NOWARN); if (!env) return ERR_PTR(-ENOMEM); log = &env->log; log->level = BPF_LOG_KERNEL; if (base_data) { base_btf = btf_parse_base(env, ".BTF.base", base_data, base_data_size); if (IS_ERR(base_btf)) { err = PTR_ERR(base_btf); goto errout; } } else { base_btf = vmlinux_btf; } btf = kzalloc(sizeof(*btf), GFP_KERNEL | __GFP_NOWARN); if (!btf) { err = -ENOMEM; goto errout; } env->btf = btf; btf->base_btf = base_btf; btf->start_id = base_btf->nr_types; btf->start_str_off = base_btf->hdr.str_len; btf->kernel_btf = true; snprintf(btf->name, sizeof(btf->name), "%s", module_name); btf->data = kvmemdup(data, data_size, GFP_KERNEL | __GFP_NOWARN); if (!btf->data) { err = -ENOMEM; goto errout; } btf->data_size = data_size; err = btf_parse_hdr(env); if (err) goto errout; btf->nohdr_data = btf->data + btf->hdr.hdr_len; err = btf_parse_str_sec(env); if (err) goto errout; err = btf_check_all_metas(env); if (err) goto errout; err = btf_check_type_tags(env, btf, btf_nr_types(base_btf)); if (err) goto errout; if (base_btf != vmlinux_btf) { err = btf_relocate(btf, vmlinux_btf, &btf->base_id_map); if (err) goto errout; btf_free(base_btf); base_btf = vmlinux_btf; } btf_verifier_env_free(env); refcount_set(&btf->refcnt, 1); return btf; errout: btf_verifier_env_free(env); if (!IS_ERR(base_btf) && base_btf != vmlinux_btf) btf_free(base_btf); if (btf) { kvfree(btf->data); kvfree(btf->types); kfree(btf); } return ERR_PTR(err); } #endif /* CONFIG_DEBUG_INFO_BTF_MODULES */ struct btf *bpf_prog_get_target_btf(const struct bpf_prog *prog) { struct bpf_prog *tgt_prog = prog->aux->dst_prog; if (tgt_prog) return tgt_prog->aux->btf; else return prog->aux->attach_btf; } static bool is_void_or_int_ptr(struct btf *btf, const struct btf_type *t) { /* skip modifiers */ t = btf_type_skip_modifiers(btf, t->type, NULL); return btf_type_is_void(t) || btf_type_is_int(t); } u32 btf_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto, int off) { const struct btf_param *args; const struct btf_type *t; u32 offset = 0, nr_args; int i; if (!func_proto) return off / 8; nr_args = btf_type_vlen(func_proto); args = (const struct btf_param *)(func_proto + 1); for (i = 0; i < nr_args; i++) { t = btf_type_skip_modifiers(btf, args[i].type, NULL); offset += btf_type_is_ptr(t) ? 8 : roundup(t->size, 8); if (off < offset) return i; } t = btf_type_skip_modifiers(btf, func_proto->type, NULL); offset += btf_type_is_ptr(t) ? 8 : roundup(t->size, 8); if (off < offset) return nr_args; return nr_args + 1; } static bool prog_args_trusted(const struct bpf_prog *prog) { enum bpf_attach_type atype = prog->expected_attach_type; switch (prog->type) { case BPF_PROG_TYPE_TRACING: return atype == BPF_TRACE_RAW_TP || atype == BPF_TRACE_ITER; case BPF_PROG_TYPE_LSM: return bpf_lsm_is_trusted(prog); case BPF_PROG_TYPE_STRUCT_OPS: return true; default: return false; } } int btf_ctx_arg_offset(const struct btf *btf, const struct btf_type *func_proto, u32 arg_no) { const struct btf_param *args; const struct btf_type *t; int off = 0, i; u32 sz; args = btf_params(func_proto); for (i = 0; i < arg_no; i++) { t = btf_type_by_id(btf, args[i].type); t = btf_resolve_size(btf, t, &sz); if (IS_ERR(t)) return PTR_ERR(t); off += roundup(sz, 8); } return off; } struct bpf_raw_tp_null_args { const char *func; u64 mask; }; static const struct bpf_raw_tp_null_args raw_tp_null_args[] = { /* sched */ { "sched_pi_setprio", 0x10 }, /* ... from sched_numa_pair_template event class */ { "sched_stick_numa", 0x100 }, { "sched_swap_numa", 0x100 }, /* afs */ { "afs_make_fs_call", 0x10 }, { "afs_make_fs_calli", 0x10 }, { "afs_make_fs_call1", 0x10 }, { "afs_make_fs_call2", 0x10 }, { "afs_protocol_error", 0x1 }, { "afs_flock_ev", 0x10 }, /* cachefiles */ { "cachefiles_lookup", 0x1 | 0x200 }, { "cachefiles_unlink", 0x1 }, { "cachefiles_rename", 0x1 }, { "cachefiles_prep_read", 0x1 }, { "cachefiles_mark_active", 0x1 }, { "cachefiles_mark_failed", 0x1 }, { "cachefiles_mark_inactive", 0x1 }, { "cachefiles_vfs_error", 0x1 }, { "cachefiles_io_error", 0x1 }, { "cachefiles_ondemand_open", 0x1 }, { "cachefiles_ondemand_copen", 0x1 }, { "cachefiles_ondemand_close", 0x1 }, { "cachefiles_ondemand_read", 0x1 }, { "cachefiles_ondemand_cread", 0x1 }, { "cachefiles_ondemand_fd_write", 0x1 }, { "cachefiles_ondemand_fd_release", 0x1 }, /* ext4, from ext4__mballoc event class */ { "ext4_mballoc_discard", 0x10 }, { "ext4_mballoc_free", 0x10 }, /* fib */ { "fib_table_lookup", 0x100 }, /* filelock */ /* ... from filelock_lock event class */ { "posix_lock_inode", 0x10 }, { "fcntl_setlk", 0x10 }, { "locks_remove_posix", 0x10 }, { "flock_lock_inode", 0x10 }, /* ... from filelock_lease event class */ { "break_lease_noblock", 0x10 }, { "break_lease_block", 0x10 }, { "break_lease_unblock", 0x10 }, { "generic_delete_lease", 0x10 }, { "time_out_leases", 0x10 }, /* host1x */ { "host1x_cdma_push_gather", 0x10000 }, /* huge_memory */ { "mm_khugepaged_scan_pmd", 0x10 }, { "mm_collapse_huge_page_isolate", 0x1 }, { "mm_khugepaged_scan_file", 0x10 }, { "mm_khugepaged_collapse_file", 0x10 }, /* kmem */ { "mm_page_alloc", 0x1 }, { "mm_page_pcpu_drain", 0x1 }, /* .. from mm_page event class */ { "mm_page_alloc_zone_locked", 0x1 }, /* netfs */ { "netfs_failure", 0x10 }, /* power */ { "device_pm_callback_start", 0x10 }, /* qdisc */ { "qdisc_dequeue", 0x1000 }, /* rxrpc */ { "rxrpc_recvdata", 0x1 }, { "rxrpc_resend", 0x10 }, { "rxrpc_tq", 0x10 }, { "rxrpc_client", 0x1 }, /* skb */ {"kfree_skb", 0x1000}, /* sunrpc */ { "xs_stream_read_data", 0x1 }, /* ... from xprt_cong_event event class */ { "xprt_reserve_cong", 0x10 }, { "xprt_release_cong", 0x10 }, { "xprt_get_cong", 0x10 }, { "xprt_put_cong", 0x10 }, /* tcp */ { "tcp_send_reset", 0x11 }, { "tcp_sendmsg_locked", 0x100 }, /* tegra_apb_dma */ { "tegra_dma_tx_status", 0x100 }, /* timer_migration */ { "tmigr_update_events", 0x1 }, /* writeback, from writeback_folio_template event class */ { "writeback_dirty_folio", 0x10 }, { "folio_wait_writeback", 0x10 }, /* rdma */ { "mr_integ_alloc", 0x2000 }, /* bpf_testmod */ { "bpf_testmod_test_read", 0x0 }, /* amdgpu */ { "amdgpu_vm_bo_map", 0x1 }, { "amdgpu_vm_bo_unmap", 0x1 }, /* netfs */ { "netfs_folioq", 0x1 }, /* xfs from xfs_defer_pending_class */ { "xfs_defer_create_intent", 0x1 }, { "xfs_defer_cancel_list", 0x1 }, { "xfs_defer_pending_finish", 0x1 }, { "xfs_defer_pending_abort", 0x1 }, { "xfs_defer_relog_intent", 0x1 }, { "xfs_defer_isolate_paused", 0x1 }, { "xfs_defer_item_pause", 0x1 }, { "xfs_defer_item_unpause", 0x1 }, /* xfs from xfs_defer_pending_item_class */ { "xfs_defer_add_item", 0x1 }, { "xfs_defer_cancel_item", 0x1 }, { "xfs_defer_finish_item", 0x1 }, /* xfs from xfs_icwalk_class */ { "xfs_ioc_free_eofblocks", 0x10 }, { "xfs_blockgc_free_space", 0x10 }, /* xfs from xfs_btree_cur_class */ { "xfs_btree_updkeys", 0x100 }, { "xfs_btree_overlapped_query_range", 0x100 }, /* xfs from xfs_imap_class*/ { "xfs_map_blocks_found", 0x10000 }, { "xfs_map_blocks_alloc", 0x10000 }, { "xfs_iomap_alloc", 0x1000 }, { "xfs_iomap_found", 0x1000 }, /* xfs from xfs_fs_class */ { "xfs_inodegc_flush", 0x1 }, { "xfs_inodegc_push", 0x1 }, { "xfs_inodegc_start", 0x1 }, { "xfs_inodegc_stop", 0x1 }, { "xfs_inodegc_queue", 0x1 }, { "xfs_inodegc_throttle", 0x1 }, { "xfs_fs_sync_fs", 0x1 }, { "xfs_blockgc_start", 0x1 }, { "xfs_blockgc_stop", 0x1 }, { "xfs_blockgc_worker", 0x1 }, { "xfs_blockgc_flush_all", 0x1 }, /* xfs_scrub */ { "xchk_nlinks_live_update", 0x10 }, /* xfs_scrub from xchk_metapath_class */ { "xchk_metapath_lookup", 0x100 }, /* nfsd */ { "nfsd_dirent", 0x1 }, { "nfsd_file_acquire", 0x1001 }, { "nfsd_file_insert_err", 0x1 }, { "nfsd_file_cons_err", 0x1 }, /* nfs4 */ { "nfs4_setup_sequence", 0x1 }, { "pnfs_update_layout", 0x10000 }, { "nfs4_inode_callback_event", 0x200 }, { "nfs4_inode_stateid_callback_event", 0x200 }, /* nfs from pnfs_layout_event */ { "pnfs_mds_fallback_pg_init_read", 0x10000 }, { "pnfs_mds_fallback_pg_init_write", 0x10000 }, { "pnfs_mds_fallback_pg_get_mirror_count", 0x10000 }, { "pnfs_mds_fallback_read_done", 0x10000 }, { "pnfs_mds_fallback_write_done", 0x10000 }, { "pnfs_mds_fallback_read_pagelist", 0x10000 }, { "pnfs_mds_fallback_write_pagelist", 0x10000 }, /* coda */ { "coda_dec_pic_run", 0x10 }, { "coda_dec_pic_done", 0x10 }, /* cfg80211 */ { "cfg80211_scan_done", 0x11 }, { "rdev_set_coalesce", 0x10 }, { "cfg80211_report_wowlan_wakeup", 0x100 }, { "cfg80211_inform_bss_frame", 0x100 }, { "cfg80211_michael_mic_failure", 0x10000 }, /* cfg80211 from wiphy_work_event */ { "wiphy_work_queue", 0x10 }, { "wiphy_work_run", 0x10 }, { "wiphy_work_cancel", 0x10 }, { "wiphy_work_flush", 0x10 }, /* hugetlbfs */ { "hugetlbfs_alloc_inode", 0x10 }, /* spufs */ { "spufs_context", 0x10 }, /* kvm_hv */ { "kvm_page_fault_enter", 0x100 }, /* dpu */ { "dpu_crtc_setup_mixer", 0x100 }, /* binder */ { "binder_transaction", 0x100 }, /* bcachefs */ { "btree_path_free", 0x100 }, /* hfi1_tx */ { "hfi1_sdma_progress", 0x1000 }, /* iptfs */ { "iptfs_ingress_postq_event", 0x1000 }, /* neigh */ { "neigh_update", 0x10 }, /* snd_firewire_lib */ { "amdtp_packet", 0x100 }, }; bool btf_ctx_access(int off, int size, enum bpf_access_type type, const struct bpf_prog *prog, struct bpf_insn_access_aux *info) { const struct btf_type *t = prog->aux->attach_func_proto; struct bpf_prog *tgt_prog = prog->aux->dst_prog; struct btf *btf = bpf_prog_get_target_btf(prog); const char *tname = prog->aux->attach_func_name; struct bpf_verifier_log *log = info->log; const struct btf_param *args; bool ptr_err_raw_tp = false; const char *tag_value; u32 nr_args, arg; int i, ret; if (off % 8) { bpf_log(log, "func '%s' offset %d is not multiple of 8\n", tname, off); return false; } arg = btf_ctx_arg_idx(btf, t, off); args = (const struct btf_param *)(t + 1); /* if (t == NULL) Fall back to default BPF prog with * MAX_BPF_FUNC_REG_ARGS u64 arguments. */ nr_args = t ? btf_type_vlen(t) : MAX_BPF_FUNC_REG_ARGS; if (prog->aux->attach_btf_trace) { /* skip first 'void *__data' argument in btf_trace_##name typedef */ args++; nr_args--; } if (arg > nr_args) { bpf_log(log, "func '%s' doesn't have %d-th argument\n", tname, arg + 1); return false; } if (arg == nr_args) { switch (prog->expected_attach_type) { case BPF_LSM_MAC: /* mark we are accessing the return value */ info->is_retval = true; fallthrough; case BPF_LSM_CGROUP: case BPF_TRACE_FEXIT: /* When LSM programs are attached to void LSM hooks * they use FEXIT trampolines and when attached to * int LSM hooks, they use MODIFY_RETURN trampolines. * * While the LSM programs are BPF_MODIFY_RETURN-like * the check: * * if (ret_type != 'int') * return -EINVAL; * * is _not_ done here. This is still safe as LSM hooks * have only void and int return types. */ if (!t) return true; t = btf_type_by_id(btf, t->type); break; case BPF_MODIFY_RETURN: /* For now the BPF_MODIFY_RETURN can only be attached to * functions that return an int. */ if (!t) return false; t = btf_type_skip_modifiers(btf, t->type, NULL); if (!btf_type_is_small_int(t)) { bpf_log(log, "ret type %s not allowed for fmod_ret\n", btf_type_str(t)); return false; } break; default: bpf_log(log, "func '%s' doesn't have %d-th argument\n", tname, arg + 1); return false; } } else { if (!t) /* Default prog with MAX_BPF_FUNC_REG_ARGS args */ return true; t = btf_type_by_id(btf, args[arg].type); } /* skip modifiers */ while (btf_type_is_modifier(t)) t = btf_type_by_id(btf, t->type); if (btf_type_is_small_int(t) || btf_is_any_enum(t) || btf_type_is_struct(t)) /* accessing a scalar */ return true; if (!btf_type_is_ptr(t)) { bpf_log(log, "func '%s' arg%d '%s' has type %s. Only pointer access is allowed\n", tname, arg, __btf_name_by_offset(btf, t->name_off), btf_type_str(t)); return false; } if (size != sizeof(u64)) { bpf_log(log, "func '%s' size %d must be 8\n", tname, size); return false; } /* check for PTR_TO_RDONLY_BUF_OR_NULL or PTR_TO_RDWR_BUF_OR_NULL */ for (i = 0; i < prog->aux->ctx_arg_info_size; i++) { const struct bpf_ctx_arg_aux *ctx_arg_info = &prog->aux->ctx_arg_info[i]; u32 type, flag; type = base_type(ctx_arg_info->reg_type); flag = type_flag(ctx_arg_info->reg_type); if (ctx_arg_info->offset == off && type == PTR_TO_BUF && (flag & PTR_MAYBE_NULL)) { info->reg_type = ctx_arg_info->reg_type; return true; } } /* * If it's a pointer to void, it's the same as scalar from the verifier * safety POV. Either way, no futher pointer walking is allowed. */ if (is_void_or_int_ptr(btf, t)) return true; /* this is a pointer to another type */ for (i = 0; i < prog->aux->ctx_arg_info_size; i++) { const struct bpf_ctx_arg_aux *ctx_arg_info = &prog->aux->ctx_arg_info[i]; if (ctx_arg_info->offset == off) { if (!ctx_arg_info->btf_id) { bpf_log(log,"invalid btf_id for context argument offset %u\n", off); return false; } info->reg_type = ctx_arg_info->reg_type; info->btf = ctx_arg_info->btf ? : btf_vmlinux; info->btf_id = ctx_arg_info->btf_id; info->ref_obj_id = ctx_arg_info->ref_obj_id; return true; } } info->reg_type = PTR_TO_BTF_ID; if (prog_args_trusted(prog)) info->reg_type |= PTR_TRUSTED; if (btf_param_match_suffix(btf, &args[arg], "__nullable")) info->reg_type |= PTR_MAYBE_NULL; if (prog->expected_attach_type == BPF_TRACE_RAW_TP) { struct btf *btf = prog->aux->attach_btf; const struct btf_type *t; const char *tname; /* BTF lookups cannot fail, return false on error */ t = btf_type_by_id(btf, prog->aux->attach_btf_id); if (!t) return false; tname = btf_name_by_offset(btf, t->name_off); if (!tname) return false; /* Checked by bpf_check_attach_target */ tname += sizeof("btf_trace_") - 1; for (i = 0; i < ARRAY_SIZE(raw_tp_null_args); i++) { /* Is this a func with potential NULL args? */ if (strcmp(tname, raw_tp_null_args[i].func)) continue; if (raw_tp_null_args[i].mask & (0x1ULL << (arg * 4))) info->reg_type |= PTR_MAYBE_NULL; /* Is the current arg IS_ERR? */ if (raw_tp_null_args[i].mask & (0x2ULL << (arg * 4))) ptr_err_raw_tp = true; break; } /* If we don't know NULL-ness specification and the tracepoint * is coming from a loadable module, be conservative and mark * argument as PTR_MAYBE_NULL. */ if (i == ARRAY_SIZE(raw_tp_null_args) && btf_is_module(btf)) info->reg_type |= PTR_MAYBE_NULL; } if (tgt_prog) { enum bpf_prog_type tgt_type; if (tgt_prog->type == BPF_PROG_TYPE_EXT) tgt_type = tgt_prog->aux->saved_dst_prog_type; else tgt_type = tgt_prog->type; ret = btf_translate_to_vmlinux(log, btf, t, tgt_type, arg); if (ret > 0) { info->btf = btf_vmlinux; info->btf_id = ret; return true; } else { return false; } } info->btf = btf; info->btf_id = t->type; t = btf_type_by_id(btf, t->type); if (btf_type_is_type_tag(t) && !btf_type_kflag(t)) { tag_value = __btf_name_by_offset(btf, t->name_off); if (strcmp(tag_value, "user") == 0) info->reg_type |= MEM_USER; if (strcmp(tag_value, "percpu") == 0) info->reg_type |= MEM_PERCPU; } /* skip modifiers */ while (btf_type_is_modifier(t)) { info->btf_id = t->type; t = btf_type_by_id(btf, t->type); } if (!btf_type_is_struct(t)) { bpf_log(log, "func '%s' arg%d type %s is not a struct\n", tname, arg, btf_type_str(t)); return false; } bpf_log(log, "func '%s' arg%d has btf_id %d type %s '%s'\n", tname, arg, info->btf_id, btf_type_str(t), __btf_name_by_offset(btf, t->name_off)); /* Perform all checks on the validity of type for this argument, but if * we know it can be IS_ERR at runtime, scrub pointer type and mark as * scalar. */ if (ptr_err_raw_tp) { bpf_log(log, "marking pointer arg%d as scalar as it may encode error", arg); info->reg_type = SCALAR_VALUE; } return true; } EXPORT_SYMBOL_GPL(btf_ctx_access); enum bpf_struct_walk_result { /* < 0 error */ WALK_SCALAR = 0, WALK_PTR, WALK_PTR_UNTRUSTED, WALK_STRUCT, }; static int btf_struct_walk(struct bpf_verifier_log *log, const struct btf *btf, const struct btf_type *t, int off, int size, u32 *next_btf_id, enum bpf_type_flag *flag, const char **field_name) { u32 i, moff, mtrue_end, msize = 0, total_nelems = 0; const struct btf_type *mtype, *elem_type = NULL; const struct btf_member *member; const char *tname, *mname, *tag_value; u32 vlen, elem_id, mid; again: if (btf_type_is_modifier(t)) t = btf_type_skip_modifiers(btf, t->type, NULL); tname = __btf_name_by_offset(btf, t->name_off); if (!btf_type_is_struct(t)) { bpf_log(log, "Type '%s' is not a struct\n", tname); return -EINVAL; } vlen = btf_type_vlen(t); if (BTF_INFO_KIND(t->info) == BTF_KIND_UNION && vlen != 1 && !(*flag & PTR_UNTRUSTED)) /* * walking unions yields untrusted pointers * with exception of __bpf_md_ptr and other * unions with a single member */ *flag |= PTR_UNTRUSTED; if (off + size > t->size) { /* If the last element is a variable size array, we may * need to relax the rule. */ struct btf_array *array_elem; if (vlen == 0) goto error; member = btf_type_member(t) + vlen - 1; mtype = btf_type_skip_modifiers(btf, member->type, NULL); if (!btf_type_is_array(mtype)) goto error; array_elem = (struct btf_array *)(mtype + 1); if (array_elem->nelems != 0) goto error; moff = __btf_member_bit_offset(t, member) / 8; if (off < moff) goto error; /* allow structure and integer */ t = btf_type_skip_modifiers(btf, array_elem->type, NULL); if (btf_type_is_int(t)) return WALK_SCALAR; if (!btf_type_is_struct(t)) goto error; off = (off - moff) % t->size; goto again; error: bpf_log(log, "access beyond struct %s at off %u size %u\n", tname, off, size); return -EACCES; } for_each_member(i, t, member) { /* offset of the field in bytes */ moff = __btf_member_bit_offset(t, member) / 8; if (off + size <= moff) /* won't find anything, field is already too far */ break; if (__btf_member_bitfield_size(t, member)) { u32 end_bit = __btf_member_bit_offset(t, member) + __btf_member_bitfield_size(t, member); /* off <= moff instead of off == moff because clang * does not generate a BTF member for anonymous * bitfield like the ":16" here: * struct { * int :16; * int x:8; * }; */ if (off <= moff && BITS_ROUNDUP_BYTES(end_bit) <= off + size) return WALK_SCALAR; /* off may be accessing a following member * * or * * Doing partial access at either end of this * bitfield. Continue on this case also to * treat it as not accessing this bitfield * and eventually error out as field not * found to keep it simple. * It could be relaxed if there was a legit * partial access case later. */ continue; } /* In case of "off" is pointing to holes of a struct */ if (off < moff) break; /* type of the field */ mid = member->type; mtype = btf_type_by_id(btf, member->type); mname = __btf_name_by_offset(btf, member->name_off); mtype = __btf_resolve_size(btf, mtype, &msize, &elem_type, &elem_id, &total_nelems, &mid); if (IS_ERR(mtype)) { bpf_log(log, "field %s doesn't have size\n", mname); return -EFAULT; } mtrue_end = moff + msize; if (off >= mtrue_end) /* no overlap with member, keep iterating */ continue; if (btf_type_is_array(mtype)) { u32 elem_idx; /* __btf_resolve_size() above helps to * linearize a multi-dimensional array. * * The logic here is treating an array * in a struct as the following way: * * struct outer { * struct inner array[2][2]; * }; * * looks like: * * struct outer { * struct inner array_elem0; * struct inner array_elem1; * struct inner array_elem2; * struct inner array_elem3; * }; * * When accessing outer->array[1][0], it moves * moff to "array_elem2", set mtype to * "struct inner", and msize also becomes * sizeof(struct inner). Then most of the * remaining logic will fall through without * caring the current member is an array or * not. * * Unlike mtype/msize/moff, mtrue_end does not * change. The naming difference ("_true") tells * that it is not always corresponding to * the current mtype/msize/moff. * It is the true end of the current * member (i.e. array in this case). That * will allow an int array to be accessed like * a scratch space, * i.e. allow access beyond the size of * the array's element as long as it is * within the mtrue_end boundary. */ /* skip empty array */ if (moff == mtrue_end) continue; msize /= total_nelems; elem_idx = (off - moff) / msize; moff += elem_idx * msize; mtype = elem_type; mid = elem_id; } /* the 'off' we're looking for is either equal to start * of this field or inside of this struct */ if (btf_type_is_struct(mtype)) { /* our field must be inside that union or struct */ t = mtype; /* return if the offset matches the member offset */ if (off == moff) { *next_btf_id = mid; return WALK_STRUCT; } /* adjust offset we're looking for */ off -= moff; goto again; } if (btf_type_is_ptr(mtype)) { const struct btf_type *stype, *t; enum bpf_type_flag tmp_flag = 0; u32 id; if (msize != size || off != moff) { bpf_log(log, "cannot access ptr member %s with moff %u in struct %s with off %u size %u\n", mname, moff, tname, off, size); return -EACCES; } /* check type tag */ t = btf_type_by_id(btf, mtype->type); if (btf_type_is_type_tag(t) && !btf_type_kflag(t)) { tag_value = __btf_name_by_offset(btf, t->name_off); /* check __user tag */ if (strcmp(tag_value, "user") == 0) tmp_flag = MEM_USER; /* check __percpu tag */ if (strcmp(tag_value, "percpu") == 0) tmp_flag = MEM_PERCPU; /* check __rcu tag */ if (strcmp(tag_value, "rcu") == 0) tmp_flag = MEM_RCU; } stype = btf_type_skip_modifiers(btf, mtype->type, &id); if (btf_type_is_struct(stype)) { *next_btf_id = id; *flag |= tmp_flag; if (field_name) *field_name = mname; return WALK_PTR; } return WALK_PTR_UNTRUSTED; } /* Allow more flexible access within an int as long as * it is within mtrue_end. * Since mtrue_end could be the end of an array, * that also allows using an array of int as a scratch * space. e.g. skb->cb[]. */ if (off + size > mtrue_end && !(*flag & PTR_UNTRUSTED)) { bpf_log(log, "access beyond the end of member %s (mend:%u) in struct %s with off %u size %u\n", mname, mtrue_end, tname, off, size); return -EACCES; } return WALK_SCALAR; } bpf_log(log, "struct %s doesn't have field at offset %d\n", tname, off); return -EINVAL; } int btf_struct_access(struct bpf_verifier_log *log, const struct bpf_reg_state *reg, int off, int size, enum bpf_access_type atype __maybe_unused, u32 *next_btf_id, enum bpf_type_flag *flag, const char **field_name) { const struct btf *btf = reg->btf; enum bpf_type_flag tmp_flag = 0; const struct btf_type *t; u32 id = reg->btf_id; int err; while (type_is_alloc(reg->type)) { struct btf_struct_meta *meta; struct btf_record *rec; int i; meta = btf_find_struct_meta(btf, id); if (!meta) break; rec = meta->record; for (i = 0; i < rec->cnt; i++) { struct btf_field *field = &rec->fields[i]; u32 offset = field->offset; if (off < offset + field->size && offset < off + size) { bpf_log(log, "direct access to %s is disallowed\n", btf_field_type_name(field->type)); return -EACCES; } } break; } t = btf_type_by_id(btf, id); do { err = btf_struct_walk(log, btf, t, off, size, &id, &tmp_flag, field_name); switch (err) { case WALK_PTR: /* For local types, the destination register cannot * become a pointer again. */ if (type_is_alloc(reg->type)) return SCALAR_VALUE; /* If we found the pointer or scalar on t+off, * we're done. */ *next_btf_id = id; *flag = tmp_flag; return PTR_TO_BTF_ID; case WALK_PTR_UNTRUSTED: *flag = MEM_RDONLY | PTR_UNTRUSTED; return PTR_TO_MEM; case WALK_SCALAR: return SCALAR_VALUE; case WALK_STRUCT: /* We found nested struct, so continue the search * by diving in it. At this point the offset is * aligned with the new type, so set it to 0. */ t = btf_type_by_id(btf, id); off = 0; break; default: /* It's either error or unknown return value.. * scream and leave. */ if (WARN_ONCE(err > 0, "unknown btf_struct_walk return value")) return -EINVAL; return err; } } while (t); return -EINVAL; } /* Check that two BTF types, each specified as an BTF object + id, are exactly * the same. Trivial ID check is not enough due to module BTFs, because we can * end up with two different module BTFs, but IDs point to the common type in * vmlinux BTF. */ bool btf_types_are_same(const struct btf *btf1, u32 id1, const struct btf *btf2, u32 id2) { if (id1 != id2) return false; if (btf1 == btf2) return true; return btf_type_by_id(btf1, id1) == btf_type_by_id(btf2, id2); } bool btf_struct_ids_match(struct bpf_verifier_log *log, const struct btf *btf, u32 id, int off, const struct btf *need_btf, u32 need_type_id, bool strict) { const struct btf_type *type; enum bpf_type_flag flag = 0; int err; /* Are we already done? */ if (off == 0 && btf_types_are_same(btf, id, need_btf, need_type_id)) return true; /* In case of strict type match, we do not walk struct, the top level * type match must succeed. When strict is true, off should have already * been 0. */ if (strict) return false; again: type = btf_type_by_id(btf, id); if (!type) return false; err = btf_struct_walk(log, btf, type, off, 1, &id, &flag, NULL); if (err != WALK_STRUCT) return false; /* We found nested struct object. If it matches * the requested ID, we're done. Otherwise let's * continue the search with offset 0 in the new * type. */ if (!btf_types_are_same(btf, id, need_btf, need_type_id)) { off = 0; goto again; } return true; } static int __get_type_size(struct btf *btf, u32 btf_id, const struct btf_type **ret_type) { const struct btf_type *t; *ret_type = btf_type_by_id(btf, 0); if (!btf_id) /* void */ return 0; t = btf_type_by_id(btf, btf_id); while (t && btf_type_is_modifier(t)) t = btf_type_by_id(btf, t->type); if (!t) return -EINVAL; *ret_type = t; if (btf_type_is_ptr(t)) /* kernel size of pointer. Not BPF's size of pointer*/ return sizeof(void *); if (btf_type_is_int(t) || btf_is_any_enum(t) || btf_type_is_struct(t)) return t->size; return -EINVAL; } static u8 __get_type_fmodel_flags(const struct btf_type *t) { u8 flags = 0; if (btf_type_is_struct(t)) flags |= BTF_FMODEL_STRUCT_ARG; if (btf_type_is_signed_int(t)) flags |= BTF_FMODEL_SIGNED_ARG; return flags; } int btf_distill_func_proto(struct bpf_verifier_log *log, struct btf *btf, const struct btf_type *func, const char *tname, struct btf_func_model *m) { const struct btf_param *args; const struct btf_type *t; u32 i, nargs; int ret; if (!func) { /* BTF function prototype doesn't match the verifier types. * Fall back to MAX_BPF_FUNC_REG_ARGS u64 args. */ for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++) { m->arg_size[i] = 8; m->arg_flags[i] = 0; } m->ret_size = 8; m->ret_flags = 0; m->nr_args = MAX_BPF_FUNC_REG_ARGS; return 0; } args = (const struct btf_param *)(func + 1); nargs = btf_type_vlen(func); if (nargs > MAX_BPF_FUNC_ARGS) { bpf_log(log, "The function %s has %d arguments. Too many.\n", tname, nargs); return -EINVAL; } ret = __get_type_size(btf, func->type, &t); if (ret < 0 || btf_type_is_struct(t)) { bpf_log(log, "The function %s return type %s is unsupported.\n", tname, btf_type_str(t)); return -EINVAL; } m->ret_size = ret; m->ret_flags = __get_type_fmodel_flags(t); for (i = 0; i < nargs; i++) { if (i == nargs - 1 && args[i].type == 0) { bpf_log(log, "The function %s with variable args is unsupported.\n", tname); return -EINVAL; } ret = __get_type_size(btf, args[i].type, &t); /* No support of struct argument size greater than 16 bytes */ if (ret < 0 || ret > 16) { bpf_log(log, "The function %s arg%d type %s is unsupported.\n", tname, i, btf_type_str(t)); return -EINVAL; } if (ret == 0) { bpf_log(log, "The function %s has malformed void argument.\n", tname); return -EINVAL; } m->arg_size[i] = ret; m->arg_flags[i] = __get_type_fmodel_flags(t); } m->nr_args = nargs; return 0; } /* Compare BTFs of two functions assuming only scalars and pointers to context. * t1 points to BTF_KIND_FUNC in btf1 * t2 points to BTF_KIND_FUNC in btf2 * Returns: * EINVAL - function prototype mismatch * EFAULT - verifier bug * 0 - 99% match. The last 1% is validated by the verifier. */ static int btf_check_func_type_match(struct bpf_verifier_log *log, struct btf *btf1, const struct btf_type *t1, struct btf *btf2, const struct btf_type *t2) { const struct btf_param *args1, *args2; const char *fn1, *fn2, *s1, *s2; u32 nargs1, nargs2, i; fn1 = btf_name_by_offset(btf1, t1->name_off); fn2 = btf_name_by_offset(btf2, t2->name_off); if (btf_func_linkage(t1) != BTF_FUNC_GLOBAL) { bpf_log(log, "%s() is not a global function\n", fn1); return -EINVAL; } if (btf_func_linkage(t2) != BTF_FUNC_GLOBAL) { bpf_log(log, "%s() is not a global function\n", fn2); return -EINVAL; } t1 = btf_type_by_id(btf1, t1->type); if (!t1 || !btf_type_is_func_proto(t1)) return -EFAULT; t2 = btf_type_by_id(btf2, t2->type); if (!t2 || !btf_type_is_func_proto(t2)) return -EFAULT; args1 = (const struct btf_param *)(t1 + 1); nargs1 = btf_type_vlen(t1); args2 = (const struct btf_param *)(t2 + 1); nargs2 = btf_type_vlen(t2); if (nargs1 != nargs2) { bpf_log(log, "%s() has %d args while %s() has %d args\n", fn1, nargs1, fn2, nargs2); return -EINVAL; } t1 = btf_type_skip_modifiers(btf1, t1->type, NULL); t2 = btf_type_skip_modifiers(btf2, t2->type, NULL); if (t1->info != t2->info) { bpf_log(log, "Return type %s of %s() doesn't match type %s of %s()\n", btf_type_str(t1), fn1, btf_type_str(t2), fn2); return -EINVAL; } for (i = 0; i < nargs1; i++) { t1 = btf_type_skip_modifiers(btf1, args1[i].type, NULL); t2 = btf_type_skip_modifiers(btf2, args2[i].type, NULL); if (t1->info != t2->info) { bpf_log(log, "arg%d in %s() is %s while %s() has %s\n", i, fn1, btf_type_str(t1), fn2, btf_type_str(t2)); return -EINVAL; } if (btf_type_has_size(t1) && t1->size != t2->size) { bpf_log(log, "arg%d in %s() has size %d while %s() has %d\n", i, fn1, t1->size, fn2, t2->size); return -EINVAL; } /* global functions are validated with scalars and pointers * to context only. And only global functions can be replaced. * Hence type check only those types. */ if (btf_type_is_int(t1) || btf_is_any_enum(t1)) continue; if (!btf_type_is_ptr(t1)) { bpf_log(log, "arg%d in %s() has unrecognized type\n", i, fn1); return -EINVAL; } t1 = btf_type_skip_modifiers(btf1, t1->type, NULL); t2 = btf_type_skip_modifiers(btf2, t2->type, NULL); if (!btf_type_is_struct(t1)) { bpf_log(log, "arg%d in %s() is not a pointer to context\n", i, fn1); return -EINVAL; } if (!btf_type_is_struct(t2)) { bpf_log(log, "arg%d in %s() is not a pointer to context\n", i, fn2); return -EINVAL; } /* This is an optional check to make program writing easier. * Compare names of structs and report an error to the user. * btf_prepare_func_args() already checked that t2 struct * is a context type. btf_prepare_func_args() will check * later that t1 struct is a context type as well. */ s1 = btf_name_by_offset(btf1, t1->name_off); s2 = btf_name_by_offset(btf2, t2->name_off); if (strcmp(s1, s2)) { bpf_log(log, "arg%d %s(struct %s *) doesn't match %s(struct %s *)\n", i, fn1, s1, fn2, s2); return -EINVAL; } } return 0; } /* Compare BTFs of given program with BTF of target program */ int btf_check_type_match(struct bpf_verifier_log *log, const struct bpf_prog *prog, struct btf *btf2, const struct btf_type *t2) { struct btf *btf1 = prog->aux->btf; const struct btf_type *t1; u32 btf_id = 0; if (!prog->aux->func_info) { bpf_log(log, "Program extension requires BTF\n"); return -EINVAL; } btf_id = prog->aux->func_info[0].type_id; if (!btf_id) return -EFAULT; t1 = btf_type_by_id(btf1, btf_id); if (!t1 || !btf_type_is_func(t1)) return -EFAULT; return btf_check_func_type_match(log, btf1, t1, btf2, t2); } static bool btf_is_dynptr_ptr(const struct btf *btf, const struct btf_type *t) { const char *name; t = btf_type_by_id(btf, t->type); /* skip PTR */ while (btf_type_is_modifier(t)) t = btf_type_by_id(btf, t->type); /* allow either struct or struct forward declaration */ if (btf_type_is_struct(t) || (btf_type_is_fwd(t) && btf_type_kflag(t) == 0)) { name = btf_str_by_offset(btf, t->name_off); return name && strcmp(name, "bpf_dynptr") == 0; } return false; } struct bpf_cand_cache { const char *name; u32 name_len; u16 kind; u16 cnt; struct { const struct btf *btf; u32 id; } cands[]; }; static DEFINE_MUTEX(cand_cache_mutex); static struct bpf_cand_cache * bpf_core_find_cands(struct bpf_core_ctx *ctx, u32 local_type_id); static int btf_get_ptr_to_btf_id(struct bpf_verifier_log *log, int arg_idx, const struct btf *btf, const struct btf_type *t) { struct bpf_cand_cache *cc; struct bpf_core_ctx ctx = { .btf = btf, .log = log, }; u32 kern_type_id, type_id; int err = 0; /* skip PTR and modifiers */ type_id = t->type; t = btf_type_by_id(btf, t->type); while (btf_type_is_modifier(t)) { type_id = t->type; t = btf_type_by_id(btf, t->type); } mutex_lock(&cand_cache_mutex); cc = bpf_core_find_cands(&ctx, type_id); if (IS_ERR(cc)) { err = PTR_ERR(cc); bpf_log(log, "arg#%d reference type('%s %s') candidate matching error: %d\n", arg_idx, btf_type_str(t), __btf_name_by_offset(btf, t->name_off), err); goto cand_cache_unlock; } if (cc->cnt != 1) { bpf_log(log, "arg#%d reference type('%s %s') %s\n", arg_idx, btf_type_str(t), __btf_name_by_offset(btf, t->name_off), cc->cnt == 0 ? "has no matches" : "is ambiguous"); err = cc->cnt == 0 ? -ENOENT : -ESRCH; goto cand_cache_unlock; } if (btf_is_module(cc->cands[0].btf)) { bpf_log(log, "arg#%d reference type('%s %s') points to kernel module type (unsupported)\n", arg_idx, btf_type_str(t), __btf_name_by_offset(btf, t->name_off)); err = -EOPNOTSUPP; goto cand_cache_unlock; } kern_type_id = cc->cands[0].id; cand_cache_unlock: mutex_unlock(&cand_cache_mutex); if (err) return err; return kern_type_id; } enum btf_arg_tag { ARG_TAG_CTX = BIT_ULL(0), ARG_TAG_NONNULL = BIT_ULL(1), ARG_TAG_TRUSTED = BIT_ULL(2), ARG_TAG_UNTRUSTED = BIT_ULL(3), ARG_TAG_NULLABLE = BIT_ULL(4), ARG_TAG_ARENA = BIT_ULL(5), }; /* Process BTF of a function to produce high-level expectation of function * arguments (like ARG_PTR_TO_CTX, or ARG_PTR_TO_MEM, etc). This information * is cached in subprog info for reuse. * Returns: * EFAULT - there is a verifier bug. Abort verification. * EINVAL - cannot convert BTF. * 0 - Successfully processed BTF and constructed argument expectations. */ int btf_prepare_func_args(struct bpf_verifier_env *env, int subprog) { bool is_global = subprog_aux(env, subprog)->linkage == BTF_FUNC_GLOBAL; struct bpf_subprog_info *sub = subprog_info(env, subprog); struct bpf_verifier_log *log = &env->log; struct bpf_prog *prog = env->prog; enum bpf_prog_type prog_type = prog->type; struct btf *btf = prog->aux->btf; const struct btf_param *args; const struct btf_type *t, *ref_t, *fn_t; u32 i, nargs, btf_id; const char *tname; if (sub->args_cached) return 0; if (!prog->aux->func_info) { verifier_bug(env, "func_info undefined"); return -EFAULT; } btf_id = prog->aux->func_info[subprog].type_id; if (!btf_id) { if (!is_global) /* not fatal for static funcs */ return -EINVAL; bpf_log(log, "Global functions need valid BTF\n"); return -EFAULT; } fn_t = btf_type_by_id(btf, btf_id); if (!fn_t || !btf_type_is_func(fn_t)) { /* These checks were already done by the verifier while loading * struct bpf_func_info */ bpf_log(log, "BTF of func#%d doesn't point to KIND_FUNC\n", subprog); return -EFAULT; } tname = btf_name_by_offset(btf, fn_t->name_off); if (prog->aux->func_info_aux[subprog].unreliable) { verifier_bug(env, "unreliable BTF for function %s()", tname); return -EFAULT; } if (prog_type == BPF_PROG_TYPE_EXT) prog_type = prog->aux->dst_prog->type; t = btf_type_by_id(btf, fn_t->type); if (!t || !btf_type_is_func_proto(t)) { bpf_log(log, "Invalid type of function %s()\n", tname); return -EFAULT; } args = (const struct btf_param *)(t + 1); nargs = btf_type_vlen(t); if (nargs > MAX_BPF_FUNC_REG_ARGS) { if (!is_global) return -EINVAL; bpf_log(log, "Global function %s() with %d > %d args. Buggy compiler.\n", tname, nargs, MAX_BPF_FUNC_REG_ARGS); return -EINVAL; } /* check that function returns int, exception cb also requires this */ t = btf_type_by_id(btf, t->type); while (btf_type_is_modifier(t)) t = btf_type_by_id(btf, t->type); if (!btf_type_is_int(t) && !btf_is_any_enum(t)) { if (!is_global) return -EINVAL; bpf_log(log, "Global function %s() doesn't return scalar. Only those are supported.\n", tname); return -EINVAL; } /* Convert BTF function arguments into verifier types. * Only PTR_TO_CTX and SCALAR are supported atm. */ for (i = 0; i < nargs; i++) { u32 tags = 0; int id = 0; /* 'arg:<tag>' decl_tag takes precedence over derivation of * register type from BTF type itself */ while ((id = btf_find_next_decl_tag(btf, fn_t, i, "arg:", id)) > 0) { const struct btf_type *tag_t = btf_type_by_id(btf, id); const char *tag = __btf_name_by_offset(btf, tag_t->name_off) + 4; /* disallow arg tags in static subprogs */ if (!is_global) { bpf_log(log, "arg#%d type tag is not supported in static functions\n", i); return -EOPNOTSUPP; } if (strcmp(tag, "ctx") == 0) { tags |= ARG_TAG_CTX; } else if (strcmp(tag, "trusted") == 0) { tags |= ARG_TAG_TRUSTED; } else if (strcmp(tag, "untrusted") == 0) { tags |= ARG_TAG_UNTRUSTED; } else if (strcmp(tag, "nonnull") == 0) { tags |= ARG_TAG_NONNULL; } else if (strcmp(tag, "nullable") == 0) { tags |= ARG_TAG_NULLABLE; } else if (strcmp(tag, "arena") == 0) { tags |= ARG_TAG_ARENA; } else { bpf_log(log, "arg#%d has unsupported set of tags\n", i); return -EOPNOTSUPP; } } if (id != -ENOENT) { bpf_log(log, "arg#%d type tag fetching failure: %d\n", i, id); return id; } t = btf_type_by_id(btf, args[i].type); while (btf_type_is_modifier(t)) t = btf_type_by_id(btf, t->type); if (!btf_type_is_ptr(t)) goto skip_pointer; if ((tags & ARG_TAG_CTX) || btf_is_prog_ctx_type(log, btf, t, prog_type, i)) { if (tags & ~ARG_TAG_CTX) { bpf_log(log, "arg#%d has invalid combination of tags\n", i); return -EINVAL; } if ((tags & ARG_TAG_CTX) && btf_validate_prog_ctx_type(log, btf, t, i, prog_type, prog->expected_attach_type)) return -EINVAL; sub->args[i].arg_type = ARG_PTR_TO_CTX; continue; } if (btf_is_dynptr_ptr(btf, t)) { if (tags) { bpf_log(log, "arg#%d has invalid combination of tags\n", i); return -EINVAL; } sub->args[i].arg_type = ARG_PTR_TO_DYNPTR | MEM_RDONLY; continue; } if (tags & ARG_TAG_TRUSTED) { int kern_type_id; if (tags & ARG_TAG_NONNULL) { bpf_log(log, "arg#%d has invalid combination of tags\n", i); return -EINVAL; } kern_type_id = btf_get_ptr_to_btf_id(log, i, btf, t); if (kern_type_id < 0) return kern_type_id; sub->args[i].arg_type = ARG_PTR_TO_BTF_ID | PTR_TRUSTED; if (tags & ARG_TAG_NULLABLE) sub->args[i].arg_type |= PTR_MAYBE_NULL; sub->args[i].btf_id = kern_type_id; continue; } if (tags & ARG_TAG_UNTRUSTED) { struct btf *vmlinux_btf; int kern_type_id; if (tags & ~ARG_TAG_UNTRUSTED) { bpf_log(log, "arg#%d untrusted cannot be combined with any other tags\n", i); return -EINVAL; } ref_t = btf_type_skip_modifiers(btf, t->type, NULL); if (btf_type_is_void(ref_t) || btf_type_is_primitive(ref_t)) { sub->args[i].arg_type = ARG_PTR_TO_MEM | MEM_RDONLY | PTR_UNTRUSTED; sub->args[i].mem_size = 0; continue; } kern_type_id = btf_get_ptr_to_btf_id(log, i, btf, t); if (kern_type_id < 0) return kern_type_id; vmlinux_btf = bpf_get_btf_vmlinux(); ref_t = btf_type_by_id(vmlinux_btf, kern_type_id); if (!btf_type_is_struct(ref_t)) { tname = __btf_name_by_offset(vmlinux_btf, t->name_off); bpf_log(log, "arg#%d has type %s '%s', but only struct or primitive types are allowed\n", i, btf_type_str(ref_t), tname); return -EINVAL; } sub->args[i].arg_type = ARG_PTR_TO_BTF_ID | PTR_UNTRUSTED; sub->args[i].btf_id = kern_type_id; continue; } if (tags & ARG_TAG_ARENA) { if (tags & ~ARG_TAG_ARENA) { bpf_log(log, "arg#%d arena cannot be combined with any other tags\n", i); return -EINVAL; } sub->args[i].arg_type = ARG_PTR_TO_ARENA; continue; } if (is_global) { /* generic user data pointer */ u32 mem_size; if (tags & ARG_TAG_NULLABLE) { bpf_log(log, "arg#%d has invalid combination of tags\n", i); return -EINVAL; } t = btf_type_skip_modifiers(btf, t->type, NULL); ref_t = btf_resolve_size(btf, t, &mem_size); if (IS_ERR(ref_t)) { bpf_log(log, "arg#%d reference type('%s %s') size cannot be determined: %ld\n", i, btf_type_str(t), btf_name_by_offset(btf, t->name_off), PTR_ERR(ref_t)); return -EINVAL; } sub->args[i].arg_type = ARG_PTR_TO_MEM | PTR_MAYBE_NULL; if (tags & ARG_TAG_NONNULL) sub->args[i].arg_type &= ~PTR_MAYBE_NULL; sub->args[i].mem_size = mem_size; continue; } skip_pointer: if (tags) { bpf_log(log, "arg#%d has pointer tag, but is not a pointer type\n", i); return -EINVAL; } if (btf_type_is_int(t) || btf_is_any_enum(t)) { sub->args[i].arg_type = ARG_ANYTHING; continue; } if (!is_global) return -EINVAL; bpf_log(log, "Arg#%d type %s in %s() is not supported yet.\n", i, btf_type_str(t), tname); return -EINVAL; } sub->arg_cnt = nargs; sub->args_cached = true; return 0; } static void btf_type_show(const struct btf *btf, u32 type_id, void *obj, struct btf_show *show) { const struct btf_type *t = btf_type_by_id(btf, type_id); show->btf = btf; memset(&show->state, 0, sizeof(show->state)); memset(&show->obj, 0, sizeof(show->obj)); btf_type_ops(t)->show(btf, t, type_id, obj, 0, show); } __printf(2, 0) static void btf_seq_show(struct btf_show *show, const char *fmt, va_list args) { seq_vprintf((struct seq_file *)show->target, fmt, args); } int btf_type_seq_show_flags(const struct btf *btf, u32 type_id, void *obj, struct seq_file *m, u64 flags) { struct btf_show sseq; sseq.target = m; sseq.showfn = btf_seq_show; sseq.flags = flags; btf_type_show(btf, type_id, obj, &sseq); return sseq.state.status; } void btf_type_seq_show(const struct btf *btf, u32 type_id, void *obj, struct seq_file *m) { (void) btf_type_seq_show_flags(btf, type_id, obj, m, BTF_SHOW_NONAME | BTF_SHOW_COMPACT | BTF_SHOW_ZERO | BTF_SHOW_UNSAFE); } struct btf_show_snprintf { struct btf_show show; int len_left; /* space left in string */ int len; /* length we would have written */ }; __printf(2, 0) static void btf_snprintf_show(struct btf_show *show, const char *fmt, va_list args) { struct btf_show_snprintf *ssnprintf = (struct btf_show_snprintf *)show; int len; len = vsnprintf(show->target, ssnprintf->len_left, fmt, args); if (len < 0) { ssnprintf->len_left = 0; ssnprintf->len = len; } else if (len >= ssnprintf->len_left) { /* no space, drive on to get length we would have written */ ssnprintf->len_left = 0; ssnprintf->len += len; } else { ssnprintf->len_left -= len; ssnprintf->len += len; show->target += len; } } int btf_type_snprintf_show(const struct btf *btf, u32 type_id, void *obj, char *buf, int len, u64 flags) { struct btf_show_snprintf ssnprintf; ssnprintf.show.target = buf; ssnprintf.show.flags = flags; ssnprintf.show.showfn = btf_snprintf_show; ssnprintf.len_left = len; ssnprintf.len = 0; btf_type_show(btf, type_id, obj, (struct btf_show *)&ssnprintf); /* If we encountered an error, return it. */ if (ssnprintf.show.state.status) return ssnprintf.show.state.status; /* Otherwise return length we would have written */ return ssnprintf.len; } #ifdef CONFIG_PROC_FS static void bpf_btf_show_fdinfo(struct seq_file *m, struct file *filp) { const struct btf *btf = filp->private_data; seq_printf(m, "btf_id:\t%u\n", btf->id); } #endif static int btf_release(struct inode *inode, struct file *filp) { btf_put(filp->private_data); return 0; } const struct file_operations btf_fops = { #ifdef CONFIG_PROC_FS .show_fdinfo = bpf_btf_show_fdinfo, #endif .release = btf_release, }; static int __btf_new_fd(struct btf *btf) { return anon_inode_getfd("btf", &btf_fops, btf, O_RDONLY | O_CLOEXEC); } int btf_new_fd(const union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size) { struct btf *btf; int ret; btf = btf_parse(attr, uattr, uattr_size); if (IS_ERR(btf)) return PTR_ERR(btf); ret = btf_alloc_id(btf); if (ret) { btf_free(btf); return ret; } /* * The BTF ID is published to the userspace. * All BTF free must go through call_rcu() from * now on (i.e. free by calling btf_put()). */ ret = __btf_new_fd(btf); if (ret < 0) btf_put(btf); return ret; } struct btf *btf_get_by_fd(int fd) { struct btf *btf; CLASS(fd, f)(fd); btf = __btf_get_by_fd(f); if (!IS_ERR(btf)) refcount_inc(&btf->refcnt); return btf; } int btf_get_info_by_fd(const struct btf *btf, const union bpf_attr *attr, union bpf_attr __user *uattr) { struct bpf_btf_info __user *uinfo; struct bpf_btf_info info; u32 info_copy, btf_copy; void __user *ubtf; char __user *uname; u32 uinfo_len, uname_len, name_len; int ret = 0; uinfo = u64_to_user_ptr(attr->info.info); uinfo_len = attr->info.info_len; info_copy = min_t(u32, uinfo_len, sizeof(info)); memset(&info, 0, sizeof(info)); if (copy_from_user(&info, uinfo, info_copy)) return -EFAULT; info.id = btf->id; ubtf = u64_to_user_ptr(info.btf); btf_copy = min_t(u32, btf->data_size, info.btf_size); if (copy_to_user(ubtf, btf->data, btf_copy)) return -EFAULT; info.btf_size = btf->data_size; info.kernel_btf = btf->kernel_btf; uname = u64_to_user_ptr(info.name); uname_len = info.name_len; if (!uname ^ !uname_len) return -EINVAL; name_len = strlen(btf->name); info.name_len = name_len; if (uname) { if (uname_len >= name_len + 1) { if (copy_to_user(uname, btf->name, name_len + 1)) return -EFAULT; } else { char zero = '\0'; if (copy_to_user(uname, btf->name, uname_len - 1)) return -EFAULT; if (put_user(zero, uname + uname_len - 1)) return -EFAULT; /* let user-space know about too short buffer */ ret = -ENOSPC; } } if (copy_to_user(uinfo, &info, info_copy) || put_user(info_copy, &uattr->info.info_len)) return -EFAULT; return ret; } int btf_get_fd_by_id(u32 id) { struct btf *btf; int fd; rcu_read_lock(); btf = idr_find(&btf_idr, id); if (!btf || !refcount_inc_not_zero(&btf->refcnt)) btf = ERR_PTR(-ENOENT); rcu_read_unlock(); if (IS_ERR(btf)) return PTR_ERR(btf); fd = __btf_new_fd(btf); if (fd < 0) btf_put(btf); return fd; } u32 btf_obj_id(const struct btf *btf) { return btf->id; } bool btf_is_kernel(const struct btf *btf) { return btf->kernel_btf; } bool btf_is_module(const struct btf *btf) { return btf->kernel_btf && strcmp(btf->name, "vmlinux") != 0; } enum { BTF_MODULE_F_LIVE = (1 << 0), }; #ifdef CONFIG_DEBUG_INFO_BTF_MODULES struct btf_module { struct list_head list; struct module *module; struct btf *btf; struct bin_attribute *sysfs_attr; int flags; }; static LIST_HEAD(btf_modules); static DEFINE_MUTEX(btf_module_mutex); static void purge_cand_cache(struct btf *btf); static int btf_module_notify(struct notifier_block *nb, unsigned long op, void *module) { struct btf_module *btf_mod, *tmp; struct module *mod = module; struct btf *btf; int err = 0; if (mod->btf_data_size == 0 || (op != MODULE_STATE_COMING && op != MODULE_STATE_LIVE && op != MODULE_STATE_GOING)) goto out; switch (op) { case MODULE_STATE_COMING: btf_mod = kzalloc(sizeof(*btf_mod), GFP_KERNEL); if (!btf_mod) { err = -ENOMEM; goto out; } btf = btf_parse_module(mod->name, mod->btf_data, mod->btf_data_size, mod->btf_base_data, mod->btf_base_data_size); if (IS_ERR(btf)) { kfree(btf_mod); if (!IS_ENABLED(CONFIG_MODULE_ALLOW_BTF_MISMATCH)) { pr_warn("failed to validate module [%s] BTF: %ld\n", mod->name, PTR_ERR(btf)); err = PTR_ERR(btf); } else { pr_warn_once("Kernel module BTF mismatch detected, BTF debug info may be unavailable for some modules\n"); } goto out; } err = btf_alloc_id(btf); if (err) { btf_free(btf); kfree(btf_mod); goto out; } purge_cand_cache(NULL); mutex_lock(&btf_module_mutex); btf_mod->module = module; btf_mod->btf = btf; list_add(&btf_mod->list, &btf_modules); mutex_unlock(&btf_module_mutex); if (IS_ENABLED(CONFIG_SYSFS)) { struct bin_attribute *attr; attr = kzalloc(sizeof(*attr), GFP_KERNEL); if (!attr) goto out; sysfs_bin_attr_init(attr); attr->attr.name = btf->name; attr->attr.mode = 0444; attr->size = btf->data_size; attr->private = btf->data; attr->read = sysfs_bin_attr_simple_read; err = sysfs_create_bin_file(btf_kobj, attr); if (err) { pr_warn("failed to register module [%s] BTF in sysfs: %d\n", mod->name, err); kfree(attr); err = 0; goto out; } btf_mod->sysfs_attr = attr; } break; case MODULE_STATE_LIVE: mutex_lock(&btf_module_mutex); list_for_each_entry_safe(btf_mod, tmp, &btf_modules, list) { if (btf_mod->module != module) continue; btf_mod->flags |= BTF_MODULE_F_LIVE; break; } mutex_unlock(&btf_module_mutex); break; case MODULE_STATE_GOING: mutex_lock(&btf_module_mutex); list_for_each_entry_safe(btf_mod, tmp, &btf_modules, list) { if (btf_mod->module != module) continue; list_del(&btf_mod->list); if (btf_mod->sysfs_attr) sysfs_remove_bin_file(btf_kobj, btf_mod->sysfs_attr); purge_cand_cache(btf_mod->btf); btf_put(btf_mod->btf); kfree(btf_mod->sysfs_attr); kfree(btf_mod); break; } mutex_unlock(&btf_module_mutex); break; } out: return notifier_from_errno(err); } static struct notifier_block btf_module_nb = { .notifier_call = btf_module_notify, }; static int __init btf_module_init(void) { register_module_notifier(&btf_module_nb); return 0; } fs_initcall(btf_module_init); #endif /* CONFIG_DEBUG_INFO_BTF_MODULES */ struct module *btf_try_get_module(const struct btf *btf) { struct module *res = NULL; #ifdef CONFIG_DEBUG_INFO_BTF_MODULES struct btf_module *btf_mod, *tmp; mutex_lock(&btf_module_mutex); list_for_each_entry_safe(btf_mod, tmp, &btf_modules, list) { if (btf_mod->btf != btf) continue; /* We must only consider module whose __init routine has * finished, hence we must check for BTF_MODULE_F_LIVE flag, * which is set from the notifier callback for * MODULE_STATE_LIVE. */ if ((btf_mod->flags & BTF_MODULE_F_LIVE) && try_module_get(btf_mod->module)) res = btf_mod->module; break; } mutex_unlock(&btf_module_mutex); #endif return res; } /* Returns struct btf corresponding to the struct module. * This function can return NULL or ERR_PTR. */ static struct btf *btf_get_module_btf(const struct module *module) { #ifdef CONFIG_DEBUG_INFO_BTF_MODULES struct btf_module *btf_mod, *tmp; #endif struct btf *btf = NULL; if (!module) { btf = bpf_get_btf_vmlinux(); if (!IS_ERR_OR_NULL(btf)) btf_get(btf); return btf; } #ifdef CONFIG_DEBUG_INFO_BTF_MODULES mutex_lock(&btf_module_mutex); list_for_each_entry_safe(btf_mod, tmp, &btf_modules, list) { if (btf_mod->module != module) continue; btf_get(btf_mod->btf); btf = btf_mod->btf; break; } mutex_unlock(&btf_module_mutex); #endif return btf; } static int check_btf_kconfigs(const struct module *module, const char *feature) { if (!module && IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) { pr_err("missing vmlinux BTF, cannot register %s\n", feature); return -ENOENT; } if (module && IS_ENABLED(CONFIG_DEBUG_INFO_BTF_MODULES)) pr_warn("missing module BTF, cannot register %s\n", feature); return 0; } BPF_CALL_4(bpf_btf_find_by_name_kind, char *, name, int, name_sz, u32, kind, int, flags) { struct btf *btf = NULL; int btf_obj_fd = 0; long ret; if (flags) return -EINVAL; if (name_sz <= 1 || name[name_sz - 1]) return -EINVAL; ret = bpf_find_btf_id(name, kind, &btf); if (ret > 0 && btf_is_module(btf)) { btf_obj_fd = __btf_new_fd(btf); if (btf_obj_fd < 0) { btf_put(btf); return btf_obj_fd; } return ret | (((u64)btf_obj_fd) << 32); } if (ret > 0) btf_put(btf); return ret; } const struct bpf_func_proto bpf_btf_find_by_name_kind_proto = { .func = bpf_btf_find_by_name_kind, .gpl_only = false, .ret_type = RET_INTEGER, .arg1_type = ARG_PTR_TO_MEM | MEM_RDONLY, .arg2_type = ARG_CONST_SIZE, .arg3_type = ARG_ANYTHING, .arg4_type = ARG_ANYTHING, }; BTF_ID_LIST_GLOBAL(btf_tracing_ids, MAX_BTF_TRACING_TYPE) #define BTF_TRACING_TYPE(name, type) BTF_ID(struct, type) BTF_TRACING_TYPE_xxx #undef BTF_TRACING_TYPE /* Validate well-formedness of iter argument type. * On success, return positive BTF ID of iter state's STRUCT type. * On error, negative error is returned. */ int btf_check_iter_arg(struct btf *btf, const struct btf_type *func, int arg_idx) { const struct btf_param *arg; const struct btf_type *t; const char *name; int btf_id; if (btf_type_vlen(func) <= arg_idx) return -EINVAL; arg = &btf_params(func)[arg_idx]; t = btf_type_skip_modifiers(btf, arg->type, NULL); if (!t || !btf_type_is_ptr(t)) return -EINVAL; t = btf_type_skip_modifiers(btf, t->type, &btf_id); if (!t || !__btf_type_is_struct(t)) return -EINVAL; name = btf_name_by_offset(btf, t->name_off); if (!name || strncmp(name, ITER_PREFIX, sizeof(ITER_PREFIX) - 1)) return -EINVAL; return btf_id; } static int btf_check_iter_kfuncs(struct btf *btf, const char *func_name, const struct btf_type *func, u32 func_flags) { u32 flags = func_flags & (KF_ITER_NEW | KF_ITER_NEXT | KF_ITER_DESTROY); const char *sfx, *iter_name; const struct btf_type *t; char exp_name[128]; u32 nr_args; int btf_id; /* exactly one of KF_ITER_{NEW,NEXT,DESTROY} can be set */ if (!flags || (flags & (flags - 1))) return -EINVAL; /* any BPF iter kfunc should have `struct bpf_iter_<type> *` first arg */ nr_args = btf_type_vlen(func); if (nr_args < 1) return -EINVAL; btf_id = btf_check_iter_arg(btf, func, 0); if (btf_id < 0) return btf_id; /* sizeof(struct bpf_iter_<type>) should be a multiple of 8 to * fit nicely in stack slots */ t = btf_type_by_id(btf, btf_id); if (t->size == 0 || (t->size % 8)) return -EINVAL; /* validate bpf_iter_<type>_{new,next,destroy}(struct bpf_iter_<type> *) * naming pattern */ iter_name = btf_name_by_offset(btf, t->name_off) + sizeof(ITER_PREFIX) - 1; if (flags & KF_ITER_NEW) sfx = "new"; else if (flags & KF_ITER_NEXT) sfx = "next"; else /* (flags & KF_ITER_DESTROY) */ sfx = "destroy"; snprintf(exp_name, sizeof(exp_name), "bpf_iter_%s_%s", iter_name, sfx); if (strcmp(func_name, exp_name)) return -EINVAL; /* only iter constructor should have extra arguments */ if (!(flags & KF_ITER_NEW) && nr_args != 1) return -EINVAL; if (flags & KF_ITER_NEXT) { /* bpf_iter_<type>_next() should return pointer */ t = btf_type_skip_modifiers(btf, func->type, NULL); if (!t || !btf_type_is_ptr(t)) return -EINVAL; } if (flags & KF_ITER_DESTROY) { /* bpf_iter_<type>_destroy() should return void */ t = btf_type_by_id(btf, func->type); if (!t || !btf_type_is_void(t)) return -EINVAL; } return 0; } static int btf_check_kfunc_protos(struct btf *btf, u32 func_id, u32 func_flags) { const struct btf_type *func; const char *func_name; int err; /* any kfunc should be FUNC -> FUNC_PROTO */ func = btf_type_by_id(btf, func_id); if (!func || !btf_type_is_func(func)) return -EINVAL; /* sanity check kfunc name */ func_name = btf_name_by_offset(btf, func->name_off); if (!func_name || !func_name[0]) return -EINVAL; func = btf_type_by_id(btf, func->type); if (!func || !btf_type_is_func_proto(func)) return -EINVAL; if (func_flags & (KF_ITER_NEW | KF_ITER_NEXT | KF_ITER_DESTROY)) { err = btf_check_iter_kfuncs(btf, func_name, func, func_flags); if (err) return err; } return 0; } /* Kernel Function (kfunc) BTF ID set registration API */ static int btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook, const struct btf_kfunc_id_set *kset) { struct btf_kfunc_hook_filter *hook_filter; struct btf_id_set8 *add_set = kset->set; bool vmlinux_set = !btf_is_module(btf); bool add_filter = !!kset->filter; struct btf_kfunc_set_tab *tab; struct btf_id_set8 *set; u32 set_cnt, i; int ret; if (hook >= BTF_KFUNC_HOOK_MAX) { ret = -EINVAL; goto end; } if (!add_set->cnt) return 0; tab = btf->kfunc_set_tab; if (tab && add_filter) { u32 i; hook_filter = &tab->hook_filters[hook]; for (i = 0; i < hook_filter->nr_filters; i++) { if (hook_filter->filters[i] == kset->filter) { add_filter = false; break; } } if (add_filter && hook_filter->nr_filters == BTF_KFUNC_FILTER_MAX_CNT) { ret = -E2BIG; goto end; } } if (!tab) { tab = kzalloc(sizeof(*tab), GFP_KERNEL | __GFP_NOWARN); if (!tab) return -ENOMEM; btf->kfunc_set_tab = tab; } set = tab->sets[hook]; /* Warn when register_btf_kfunc_id_set is called twice for the same hook * for module sets. */ if (WARN_ON_ONCE(set && !vmlinux_set)) { ret = -EINVAL; goto end; } /* In case of vmlinux sets, there may be more than one set being * registered per hook. To create a unified set, we allocate a new set * and concatenate all individual sets being registered. While each set * is individually sorted, they may become unsorted when concatenated, * hence re-sorting the final set again is required to make binary * searching the set using btf_id_set8_contains function work. * * For module sets, we need to allocate as we may need to relocate * BTF ids. */ set_cnt = set ? set->cnt : 0; if (set_cnt > U32_MAX - add_set->cnt) { ret = -EOVERFLOW; goto end; } if (set_cnt + add_set->cnt > BTF_KFUNC_SET_MAX_CNT) { ret = -E2BIG; goto end; } /* Grow set */ set = krealloc(tab->sets[hook], struct_size(set, pairs, set_cnt + add_set->cnt), GFP_KERNEL | __GFP_NOWARN); if (!set) { ret = -ENOMEM; goto end; } /* For newly allocated set, initialize set->cnt to 0 */ if (!tab->sets[hook]) set->cnt = 0; tab->sets[hook] = set; /* Concatenate the two sets */ memcpy(set->pairs + set->cnt, add_set->pairs, add_set->cnt * sizeof(set->pairs[0])); /* Now that the set is copied, update with relocated BTF ids */ for (i = set->cnt; i < set->cnt + add_set->cnt; i++) set->pairs[i].id = btf_relocate_id(btf, set->pairs[i].id); set->cnt += add_set->cnt; sort(set->pairs, set->cnt, sizeof(set->pairs[0]), btf_id_cmp_func, NULL); if (add_filter) { hook_filter = &tab->hook_filters[hook]; hook_filter->filters[hook_filter->nr_filters++] = kset->filter; } return 0; end: btf_free_kfunc_set_tab(btf); return ret; } static u32 *__btf_kfunc_id_set_contains(const struct btf *btf, enum btf_kfunc_hook hook, u32 kfunc_btf_id, const struct bpf_prog *prog) { struct btf_kfunc_hook_filter *hook_filter; struct btf_id_set8 *set; u32 *id, i; if (hook >= BTF_KFUNC_HOOK_MAX) return NULL; if (!btf->kfunc_set_tab) return NULL; hook_filter = &btf->kfunc_set_tab->hook_filters[hook]; for (i = 0; i < hook_filter->nr_filters; i++) { if (hook_filter->filters[i](prog, kfunc_btf_id)) return NULL; } set = btf->kfunc_set_tab->sets[hook]; if (!set) return NULL; id = btf_id_set8_contains(set, kfunc_btf_id); if (!id) return NULL; /* The flags for BTF ID are located next to it */ return id + 1; } static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type) { switch (prog_type) { case BPF_PROG_TYPE_UNSPEC: return BTF_KFUNC_HOOK_COMMON; case BPF_PROG_TYPE_XDP: return BTF_KFUNC_HOOK_XDP; case BPF_PROG_TYPE_SCHED_CLS: return BTF_KFUNC_HOOK_TC; case BPF_PROG_TYPE_STRUCT_OPS: return BTF_KFUNC_HOOK_STRUCT_OPS; case BPF_PROG_TYPE_TRACING: case BPF_PROG_TYPE_TRACEPOINT: case BPF_PROG_TYPE_PERF_EVENT: case BPF_PROG_TYPE_LSM: return BTF_KFUNC_HOOK_TRACING; case BPF_PROG_TYPE_SYSCALL: return BTF_KFUNC_HOOK_SYSCALL; case BPF_PROG_TYPE_CGROUP_SKB: case BPF_PROG_TYPE_CGROUP_SOCK: case BPF_PROG_TYPE_CGROUP_DEVICE: case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: case BPF_PROG_TYPE_CGROUP_SOCKOPT: case BPF_PROG_TYPE_CGROUP_SYSCTL: case BPF_PROG_TYPE_SOCK_OPS: return BTF_KFUNC_HOOK_CGROUP; case BPF_PROG_TYPE_SCHED_ACT: return BTF_KFUNC_HOOK_SCHED_ACT; case BPF_PROG_TYPE_SK_SKB: return BTF_KFUNC_HOOK_SK_SKB; case BPF_PROG_TYPE_SOCKET_FILTER: return BTF_KFUNC_HOOK_SOCKET_FILTER; case BPF_PROG_TYPE_LWT_OUT: case BPF_PROG_TYPE_LWT_IN: case BPF_PROG_TYPE_LWT_XMIT: case BPF_PROG_TYPE_LWT_SEG6LOCAL: return BTF_KFUNC_HOOK_LWT; case BPF_PROG_TYPE_NETFILTER: return BTF_KFUNC_HOOK_NETFILTER; case BPF_PROG_TYPE_KPROBE: return BTF_KFUNC_HOOK_KPROBE; default: return BTF_KFUNC_HOOK_MAX; } } /* Caution: * Reference to the module (obtained using btf_try_get_module) corresponding to * the struct btf *MUST* be held when calling this function from verifier * context. This is usually true as we stash references in prog's kfunc_btf_tab; * keeping the reference for the duration of the call provides the necessary * protection for looking up a well-formed btf->kfunc_set_tab. */ u32 *btf_kfunc_id_set_contains(const struct btf *btf, u32 kfunc_btf_id, const struct bpf_prog *prog) { enum bpf_prog_type prog_type = resolve_prog_type(prog); enum btf_kfunc_hook hook; u32 *kfunc_flags; kfunc_flags = __btf_kfunc_id_set_contains(btf, BTF_KFUNC_HOOK_COMMON, kfunc_btf_id, prog); if (kfunc_flags) return kfunc_flags; hook = bpf_prog_type_to_kfunc_hook(prog_type); return __btf_kfunc_id_set_contains(btf, hook, kfunc_btf_id, prog); } u32 *btf_kfunc_is_modify_return(const struct btf *btf, u32 kfunc_btf_id, const struct bpf_prog *prog) { return __btf_kfunc_id_set_contains(btf, BTF_KFUNC_HOOK_FMODRET, kfunc_btf_id, prog); } static int __register_btf_kfunc_id_set(enum btf_kfunc_hook hook, const struct btf_kfunc_id_set *kset) { struct btf *btf; int ret, i; btf = btf_get_module_btf(kset->owner); if (!btf) return check_btf_kconfigs(kset->owner, "kfunc"); if (IS_ERR(btf)) return PTR_ERR(btf); for (i = 0; i < kset->set->cnt; i++) { ret = btf_check_kfunc_protos(btf, btf_relocate_id(btf, kset->set->pairs[i].id), kset->set->pairs[i].flags); if (ret) goto err_out; } ret = btf_populate_kfunc_set(btf, hook, kset); err_out: btf_put(btf); return ret; } /* This function must be invoked only from initcalls/module init functions */ int register_btf_kfunc_id_set(enum bpf_prog_type prog_type, const struct btf_kfunc_id_set *kset) { enum btf_kfunc_hook hook; /* All kfuncs need to be tagged as such in BTF. * WARN() for initcall registrations that do not check errors. */ if (!(kset->set->flags & BTF_SET8_KFUNCS)) { WARN_ON(!kset->owner); return -EINVAL; } hook = bpf_prog_type_to_kfunc_hook(prog_type); return __register_btf_kfunc_id_set(hook, kset); } EXPORT_SYMBOL_GPL(register_btf_kfunc_id_set); /* This function must be invoked only from initcalls/module init functions */ int register_btf_fmodret_id_set(const struct btf_kfunc_id_set *kset) { return __register_btf_kfunc_id_set(BTF_KFUNC_HOOK_FMODRET, kset); } EXPORT_SYMBOL_GPL(register_btf_fmodret_id_set); s32 btf_find_dtor_kfunc(struct btf *btf, u32 btf_id) { struct btf_id_dtor_kfunc_tab *tab = btf->dtor_kfunc_tab; struct btf_id_dtor_kfunc *dtor; if (!tab) return -ENOENT; /* Even though the size of tab->dtors[0] is > sizeof(u32), we only need * to compare the first u32 with btf_id, so we can reuse btf_id_cmp_func. */ BUILD_BUG_ON(offsetof(struct btf_id_dtor_kfunc, btf_id) != 0); dtor = bsearch(&btf_id, tab->dtors, tab->cnt, sizeof(tab->dtors[0]), btf_id_cmp_func); if (!dtor) return -ENOENT; return dtor->kfunc_btf_id; } static int btf_check_dtor_kfuncs(struct btf *btf, const struct btf_id_dtor_kfunc *dtors, u32 cnt) { const struct btf_type *dtor_func, *dtor_func_proto, *t; const struct btf_param *args; s32 dtor_btf_id; u32 nr_args, i; for (i = 0; i < cnt; i++) { dtor_btf_id = btf_relocate_id(btf, dtors[i].kfunc_btf_id); dtor_func = btf_type_by_id(btf, dtor_btf_id); if (!dtor_func || !btf_type_is_func(dtor_func)) return -EINVAL; dtor_func_proto = btf_type_by_id(btf, dtor_func->type); if (!dtor_func_proto || !btf_type_is_func_proto(dtor_func_proto)) return -EINVAL; /* Make sure the prototype of the destructor kfunc is 'void func(type *)' */ t = btf_type_by_id(btf, dtor_func_proto->type); if (!t || !btf_type_is_void(t)) return -EINVAL; nr_args = btf_type_vlen(dtor_func_proto); if (nr_args != 1) return -EINVAL; args = btf_params(dtor_func_proto); t = btf_type_by_id(btf, args[0].type); /* Allow any pointer type, as width on targets Linux supports * will be same for all pointer types (i.e. sizeof(void *)) */ if (!t || !btf_type_is_ptr(t)) return -EINVAL; } return 0; } /* This function must be invoked only from initcalls/module init functions */ int register_btf_id_dtor_kfuncs(const struct btf_id_dtor_kfunc *dtors, u32 add_cnt, struct module *owner) { struct btf_id_dtor_kfunc_tab *tab; struct btf *btf; u32 tab_cnt, i; int ret; btf = btf_get_module_btf(owner); if (!btf) return check_btf_kconfigs(owner, "dtor kfuncs"); if (IS_ERR(btf)) return PTR_ERR(btf); if (add_cnt >= BTF_DTOR_KFUNC_MAX_CNT) { pr_err("cannot register more than %d kfunc destructors\n", BTF_DTOR_KFUNC_MAX_CNT); ret = -E2BIG; goto end; } /* Ensure that the prototype of dtor kfuncs being registered is sane */ ret = btf_check_dtor_kfuncs(btf, dtors, add_cnt); if (ret < 0) goto end; tab = btf->dtor_kfunc_tab; /* Only one call allowed for modules */ if (WARN_ON_ONCE(tab && btf_is_module(btf))) { ret = -EINVAL; goto end; } tab_cnt = tab ? tab->cnt : 0; if (tab_cnt > U32_MAX - add_cnt) { ret = -EOVERFLOW; goto end; } if (tab_cnt + add_cnt >= BTF_DTOR_KFUNC_MAX_CNT) { pr_err("cannot register more than %d kfunc destructors\n", BTF_DTOR_KFUNC_MAX_CNT); ret = -E2BIG; goto end; } tab = krealloc(btf->dtor_kfunc_tab, struct_size(tab, dtors, tab_cnt + add_cnt), GFP_KERNEL | __GFP_NOWARN); if (!tab) { ret = -ENOMEM; goto end; } if (!btf->dtor_kfunc_tab) tab->cnt = 0; btf->dtor_kfunc_tab = tab; memcpy(tab->dtors + tab->cnt, dtors, add_cnt * sizeof(tab->dtors[0])); /* remap BTF ids based on BTF relocation (if any) */ for (i = tab_cnt; i < tab_cnt + add_cnt; i++) { tab->dtors[i].btf_id = btf_relocate_id(btf, tab->dtors[i].btf_id); tab->dtors[i].kfunc_btf_id = btf_relocate_id(btf, tab->dtors[i].kfunc_btf_id); } tab->cnt += add_cnt; sort(tab->dtors, tab->cnt, sizeof(tab->dtors[0]), btf_id_cmp_func, NULL); end: if (ret) btf_free_dtor_kfunc_tab(btf); btf_put(btf); return ret; } EXPORT_SYMBOL_GPL(register_btf_id_dtor_kfuncs); #define MAX_TYPES_ARE_COMPAT_DEPTH 2 /* Check local and target types for compatibility. This check is used for * type-based CO-RE relocations and follow slightly different rules than * field-based relocations. This function assumes that root types were already * checked for name match. Beyond that initial root-level name check, names * are completely ignored. Compatibility rules are as follows: * - any two STRUCTs/UNIONs/FWDs/ENUMs/INTs/ENUM64s are considered compatible, but * kind should match for local and target types (i.e., STRUCT is not * compatible with UNION); * - for ENUMs/ENUM64s, the size is ignored; * - for INT, size and signedness are ignored; * - for ARRAY, dimensionality is ignored, element types are checked for * compatibility recursively; * - CONST/VOLATILE/RESTRICT modifiers are ignored; * - TYPEDEFs/PTRs are compatible if types they pointing to are compatible; * - FUNC_PROTOs are compatible if they have compatible signature: same * number of input args and compatible return and argument types. * These rules are not set in stone and probably will be adjusted as we get * more experience with using BPF CO-RE relocations. */ int bpf_core_types_are_compat(const struct btf *local_btf, __u32 local_id, const struct btf *targ_btf, __u32 targ_id) { return __bpf_core_types_are_compat(local_btf, local_id, targ_btf, targ_id, MAX_TYPES_ARE_COMPAT_DEPTH); } #define MAX_TYPES_MATCH_DEPTH 2 int bpf_core_types_match(const struct btf *local_btf, u32 local_id, const struct btf *targ_btf, u32 targ_id) { return __bpf_core_types_match(local_btf, local_id, targ_btf, targ_id, false, MAX_TYPES_MATCH_DEPTH); } static bool bpf_core_is_flavor_sep(const char *s) { /* check X___Y name pattern, where X and Y are not underscores */ return s[0] != '_' && /* X */ s[1] == '_' && s[2] == '_' && s[3] == '_' && /* ___ */ s[4] != '_'; /* Y */ } size_t bpf_core_essential_name_len(const char *name) { size_t n = strlen(name); int i; for (i = n - 5; i >= 0; i--) { if (bpf_core_is_flavor_sep(name + i)) return i + 1; } return n; } static void bpf_free_cands(struct bpf_cand_cache *cands) { if (!cands->cnt) /* empty candidate array was allocated on stack */ return; kfree(cands); } static void bpf_free_cands_from_cache(struct bpf_cand_cache *cands) { kfree(cands->name); kfree(cands); } #define VMLINUX_CAND_CACHE_SIZE 31 static struct bpf_cand_cache *vmlinux_cand_cache[VMLINUX_CAND_CACHE_SIZE]; #define MODULE_CAND_CACHE_SIZE 31 static struct bpf_cand_cache *module_cand_cache[MODULE_CAND_CACHE_SIZE]; static void __print_cand_cache(struct bpf_verifier_log *log, struct bpf_cand_cache **cache, int cache_size) { struct bpf_cand_cache *cc; int i, j; for (i = 0; i < cache_size; i++) { cc = cache[i]; if (!cc) continue; bpf_log(log, "[%d]%s(", i, cc->name); for (j = 0; j < cc->cnt; j++) { bpf_log(log, "%d", cc->cands[j].id); if (j < cc->cnt - 1) bpf_log(log, " "); } bpf_log(log, "), "); } } static void print_cand_cache(struct bpf_verifier_log *log) { mutex_lock(&cand_cache_mutex); bpf_log(log, "vmlinux_cand_cache:"); __print_cand_cache(log, vmlinux_cand_cache, VMLINUX_CAND_CACHE_SIZE); bpf_log(log, "\nmodule_cand_cache:"); __print_cand_cache(log, module_cand_cache, MODULE_CAND_CACHE_SIZE); bpf_log(log, "\n"); mutex_unlock(&cand_cache_mutex); } static u32 hash_cands(struct bpf_cand_cache *cands) { return jhash(cands->name, cands->name_len, 0); } static struct bpf_cand_cache *check_cand_cache(struct bpf_cand_cache *cands, struct bpf_cand_cache **cache, int cache_size) { struct bpf_cand_cache *cc = cache[hash_cands(cands) % cache_size]; if (cc && cc->name_len == cands->name_len && !strncmp(cc->name, cands->name, cands->name_len)) return cc; return NULL; } static size_t sizeof_cands(int cnt) { return offsetof(struct bpf_cand_cache, cands[cnt]); } static struct bpf_cand_cache *populate_cand_cache(struct bpf_cand_cache *cands, struct bpf_cand_cache **cache, int cache_size) { struct bpf_cand_cache **cc = &cache[hash_cands(cands) % cache_size], *new_cands; if (*cc) { bpf_free_cands_from_cache(*cc); *cc = NULL; } new_cands = kmemdup(cands, sizeof_cands(cands->cnt), GFP_KERNEL_ACCOUNT); if (!new_cands) { bpf_free_cands(cands); return ERR_PTR(-ENOMEM); } /* strdup the name, since it will stay in cache. * the cands->name points to strings in prog's BTF and the prog can be unloaded. */ new_cands->name = kmemdup_nul(cands->name, cands->name_len, GFP_KERNEL_ACCOUNT); bpf_free_cands(cands); if (!new_cands->name) { kfree(new_cands); return ERR_PTR(-ENOMEM); } *cc = new_cands; return new_cands; } #ifdef CONFIG_DEBUG_INFO_BTF_MODULES static void __purge_cand_cache(struct btf *btf, struct bpf_cand_cache **cache, int cache_size) { struct bpf_cand_cache *cc; int i, j; for (i = 0; i < cache_size; i++) { cc = cache[i]; if (!cc) continue; if (!btf) { /* when new module is loaded purge all of module_cand_cache, * since new module might have candidates with the name * that matches cached cands. */ bpf_free_cands_from_cache(cc); cache[i] = NULL; continue; } /* when module is unloaded purge cache entries * that match module's btf */ for (j = 0; j < cc->cnt; j++) if (cc->cands[j].btf == btf) { bpf_free_cands_from_cache(cc); cache[i] = NULL; break; } } } static void purge_cand_cache(struct btf *btf) { mutex_lock(&cand_cache_mutex); __purge_cand_cache(btf, module_cand_cache, MODULE_CAND_CACHE_SIZE); mutex_unlock(&cand_cache_mutex); } #endif static struct bpf_cand_cache * bpf_core_add_cands(struct bpf_cand_cache *cands, const struct btf *targ_btf, int targ_start_id) { struct bpf_cand_cache *new_cands; const struct btf_type *t; const char *targ_name; size_t targ_essent_len; int n, i; n = btf_nr_types(targ_btf); for (i = targ_start_id; i < n; i++) { t = btf_type_by_id(targ_btf, i); if (btf_kind(t) != cands->kind) continue; targ_name = btf_name_by_offset(targ_btf, t->name_off); if (!targ_name) continue; /* the resched point is before strncmp to make sure that search * for non-existing name will have a chance to schedule(). */ cond_resched(); if (strncmp(cands->name, targ_name, cands->name_len) != 0) continue; targ_essent_len = bpf_core_essential_name_len(targ_name); if (targ_essent_len != cands->name_len) continue; /* most of the time there is only one candidate for a given kind+name pair */ new_cands = kmalloc(sizeof_cands(cands->cnt + 1), GFP_KERNEL_ACCOUNT); if (!new_cands) { bpf_free_cands(cands); return ERR_PTR(-ENOMEM); } memcpy(new_cands, cands, sizeof_cands(cands->cnt)); bpf_free_cands(cands); cands = new_cands; cands->cands[cands->cnt].btf = targ_btf; cands->cands[cands->cnt].id = i; cands->cnt++; } return cands; } static struct bpf_cand_cache * bpf_core_find_cands(struct bpf_core_ctx *ctx, u32 local_type_id) { struct bpf_cand_cache *cands, *cc, local_cand = {}; const struct btf *local_btf = ctx->btf; const struct btf_type *local_type; const struct btf *main_btf; size_t local_essent_len; struct btf *mod_btf; const char *name; int id; main_btf = bpf_get_btf_vmlinux(); if (IS_ERR(main_btf)) return ERR_CAST(main_btf); if (!main_btf) return ERR_PTR(-EINVAL); local_type = btf_type_by_id(local_btf, local_type_id); if (!local_type) return ERR_PTR(-EINVAL); name = btf_name_by_offset(local_btf, local_type->name_off); if (str_is_empty(name)) return ERR_PTR(-EINVAL); local_essent_len = bpf_core_essential_name_len(name); cands = &local_cand; cands->name = name; cands->kind = btf_kind(local_type); cands->name_len = local_essent_len; cc = check_cand_cache(cands, vmlinux_cand_cache, VMLINUX_CAND_CACHE_SIZE); /* cands is a pointer to stack here */ if (cc) { if (cc->cnt) return cc; goto check_modules; } /* Attempt to find target candidates in vmlinux BTF first */ cands = bpf_core_add_cands(cands, main_btf, 1); if (IS_ERR(cands)) return ERR_CAST(cands); /* cands is a pointer to kmalloced memory here if cands->cnt > 0 */ /* populate cache even when cands->cnt == 0 */ cc = populate_cand_cache(cands, vmlinux_cand_cache, VMLINUX_CAND_CACHE_SIZE); if (IS_ERR(cc)) return ERR_CAST(cc); /* if vmlinux BTF has any candidate, don't go for module BTFs */ if (cc->cnt) return cc; check_modules: /* cands is a pointer to stack here and cands->cnt == 0 */ cc = check_cand_cache(cands, module_cand_cache, MODULE_CAND_CACHE_SIZE); if (cc) /* if cache has it return it even if cc->cnt == 0 */ return cc; /* If candidate is not found in vmlinux's BTF then search in module's BTFs */ spin_lock_bh(&btf_idr_lock); idr_for_each_entry(&btf_idr, mod_btf, id) { if (!btf_is_module(mod_btf)) continue; /* linear search could be slow hence unlock/lock * the IDR to avoiding holding it for too long */ btf_get(mod_btf); spin_unlock_bh(&btf_idr_lock); cands = bpf_core_add_cands(cands, mod_btf, btf_nr_types(main_btf)); btf_put(mod_btf); if (IS_ERR(cands)) return ERR_CAST(cands); spin_lock_bh(&btf_idr_lock); } spin_unlock_bh(&btf_idr_lock); /* cands is a pointer to kmalloced memory here if cands->cnt > 0 * or pointer to stack if cands->cnd == 0. * Copy it into the cache even when cands->cnt == 0 and * return the result. */ return populate_cand_cache(cands, module_cand_cache, MODULE_CAND_CACHE_SIZE); } int bpf_core_apply(struct bpf_core_ctx *ctx, const struct bpf_core_relo *relo, int relo_idx, void *insn) { bool need_cands = relo->kind != BPF_CORE_TYPE_ID_LOCAL; struct bpf_core_cand_list cands = {}; struct bpf_core_relo_res targ_res; struct bpf_core_spec *specs; const struct btf_type *type; int err; /* ~4k of temp memory necessary to convert LLVM spec like "0:1:0:5" * into arrays of btf_ids of struct fields and array indices. */ specs = kcalloc(3, sizeof(*specs), GFP_KERNEL_ACCOUNT); if (!specs) return -ENOMEM; type = btf_type_by_id(ctx->btf, relo->type_id); if (!type) { bpf_log(ctx->log, "relo #%u: bad type id %u\n", relo_idx, relo->type_id); kfree(specs); return -EINVAL; } if (need_cands) { struct bpf_cand_cache *cc; int i; mutex_lock(&cand_cache_mutex); cc = bpf_core_find_cands(ctx, relo->type_id); if (IS_ERR(cc)) { bpf_log(ctx->log, "target candidate search failed for %d\n", relo->type_id); err = PTR_ERR(cc); goto out; } if (cc->cnt) { cands.cands = kcalloc(cc->cnt, sizeof(*cands.cands), GFP_KERNEL_ACCOUNT); if (!cands.cands) { err = -ENOMEM; goto out; } } for (i = 0; i < cc->cnt; i++) { bpf_log(ctx->log, "CO-RE relocating %s %s: found target candidate [%d]\n", btf_kind_str[cc->kind], cc->name, cc->cands[i].id); cands.cands[i].btf = cc->cands[i].btf; cands.cands[i].id = cc->cands[i].id; } cands.len = cc->cnt; /* cand_cache_mutex needs to span the cache lookup and * copy of btf pointer into bpf_core_cand_list, * since module can be unloaded while bpf_core_calc_relo_insn * is working with module's btf. */ } err = bpf_core_calc_relo_insn((void *)ctx->log, relo, relo_idx, ctx->btf, &cands, specs, &targ_res); if (err) goto out; err = bpf_core_patch_insn((void *)ctx->log, insn, relo->insn_off / 8, relo, relo_idx, &targ_res); out: kfree(specs); if (need_cands) { kfree(cands.cands); mutex_unlock(&cand_cache_mutex); if (ctx->log->level & BPF_LOG_LEVEL2) print_cand_cache(ctx->log); } return err; } bool btf_nested_type_is_trusted(struct bpf_verifier_log *log, const struct bpf_reg_state *reg, const char *field_name, u32 btf_id, const char *suffix) { struct btf *btf = reg->btf; const struct btf_type *walk_type, *safe_type; const char *tname; char safe_tname[64]; long ret, safe_id; const struct btf_member *member; u32 i; walk_type = btf_type_by_id(btf, reg->btf_id); if (!walk_type) return false; tname = btf_name_by_offset(btf, walk_type->name_off); ret = snprintf(safe_tname, sizeof(safe_tname), "%s%s", tname, suffix); if (ret >= sizeof(safe_tname)) return false; safe_id = btf_find_by_name_kind(btf, safe_tname, BTF_INFO_KIND(walk_type->info)); if (safe_id < 0) return false; safe_type = btf_type_by_id(btf, safe_id); if (!safe_type) return false; for_each_member(i, safe_type, member) { const char *m_name = __btf_name_by_offset(btf, member->name_off); const struct btf_type *mtype = btf_type_by_id(btf, member->type); u32 id; if (!btf_type_is_ptr(mtype)) continue; btf_type_skip_modifiers(btf, mtype->type, &id); /* If we match on both type and name, the field is considered trusted. */ if (btf_id == id && !strcmp(field_name, m_name)) return true; } return false; } bool btf_type_ids_nocast_alias(struct bpf_verifier_log *log, const struct btf *reg_btf, u32 reg_id, const struct btf *arg_btf, u32 arg_id) { const char *reg_name, *arg_name, *search_needle; const struct btf_type *reg_type, *arg_type; int reg_len, arg_len, cmp_len; size_t pattern_len = sizeof(NOCAST_ALIAS_SUFFIX) - sizeof(char); reg_type = btf_type_by_id(reg_btf, reg_id); if (!reg_type) return false; arg_type = btf_type_by_id(arg_btf, arg_id); if (!arg_type) return false; reg_name = btf_name_by_offset(reg_btf, reg_type->name_off); arg_name = btf_name_by_offset(arg_btf, arg_type->name_off); reg_len = strlen(reg_name); arg_len = strlen(arg_name); /* Exactly one of the two type names may be suffixed with ___init, so * if the strings are the same size, they can't possibly be no-cast * aliases of one another. If you have two of the same type names, e.g. * they're both nf_conn___init, it would be improper to return true * because they are _not_ no-cast aliases, they are the same type. */ if (reg_len == arg_len) return false; /* Either of the two names must be the other name, suffixed with ___init. */ if ((reg_len != arg_len + pattern_len) && (arg_len != reg_len + pattern_len)) return false; if (reg_len < arg_len) { search_needle = strstr(arg_name, NOCAST_ALIAS_SUFFIX); cmp_len = reg_len; } else { search_needle = strstr(reg_name, NOCAST_ALIAS_SUFFIX); cmp_len = arg_len; } if (!search_needle) return false; /* ___init suffix must come at the end of the name */ if (*(search_needle + pattern_len) != '\0') return false; return !strncmp(reg_name, arg_name, cmp_len); } #ifdef CONFIG_BPF_JIT static int btf_add_struct_ops(struct btf *btf, struct bpf_struct_ops *st_ops, struct bpf_verifier_log *log) { struct btf_struct_ops_tab *tab, *new_tab; int i, err; tab = btf->struct_ops_tab; if (!tab) { tab = kzalloc(struct_size(tab, ops, 4), GFP_KERNEL); if (!tab) return -ENOMEM; tab->capacity = 4; btf->struct_ops_tab = tab; } for (i = 0; i < tab->cnt; i++) if (tab->ops[i].st_ops == st_ops) return -EEXIST; if (tab->cnt == tab->capacity) { new_tab = krealloc(tab, struct_size(tab, ops, tab->capacity * 2), GFP_KERNEL); if (!new_tab) return -ENOMEM; tab = new_tab; tab->capacity *= 2; btf->struct_ops_tab = tab; } tab->ops[btf->struct_ops_tab->cnt].st_ops = st_ops; err = bpf_struct_ops_desc_init(&tab->ops[btf->struct_ops_tab->cnt], btf, log); if (err) return err; btf->struct_ops_tab->cnt++; return 0; } const struct bpf_struct_ops_desc * bpf_struct_ops_find_value(struct btf *btf, u32 value_id) { const struct bpf_struct_ops_desc *st_ops_list; unsigned int i; u32 cnt; if (!value_id) return NULL; if (!btf->struct_ops_tab) return NULL; cnt = btf->struct_ops_tab->cnt; st_ops_list = btf->struct_ops_tab->ops; for (i = 0; i < cnt; i++) { if (st_ops_list[i].value_id == value_id) return &st_ops_list[i]; } return NULL; } const struct bpf_struct_ops_desc * bpf_struct_ops_find(struct btf *btf, u32 type_id) { const struct bpf_struct_ops_desc *st_ops_list; unsigned int i; u32 cnt; if (!type_id) return NULL; if (!btf->struct_ops_tab) return NULL; cnt = btf->struct_ops_tab->cnt; st_ops_list = btf->struct_ops_tab->ops; for (i = 0; i < cnt; i++) { if (st_ops_list[i].type_id == type_id) return &st_ops_list[i]; } return NULL; } int __register_bpf_struct_ops(struct bpf_struct_ops *st_ops) { struct bpf_verifier_log *log; struct btf *btf; int err = 0; btf = btf_get_module_btf(st_ops->owner); if (!btf) return check_btf_kconfigs(st_ops->owner, "struct_ops"); if (IS_ERR(btf)) return PTR_ERR(btf); log = kzalloc(sizeof(*log), GFP_KERNEL | __GFP_NOWARN); if (!log) { err = -ENOMEM; goto errout; } log->level = BPF_LOG_KERNEL; err = btf_add_struct_ops(btf, st_ops, log); errout: kfree(log); btf_put(btf); return err; } EXPORT_SYMBOL_GPL(__register_bpf_struct_ops); #endif bool btf_param_match_suffix(const struct btf *btf, const struct btf_param *arg, const char *suffix) { int suffix_len = strlen(suffix), len; const char *param_name; /* In the future, this can be ported to use BTF tagging */ param_name = btf_name_by_offset(btf, arg->name_off); if (str_is_empty(param_name)) return false; len = strlen(param_name); if (len <= suffix_len) return false; param_name += len - suffix_len; return !strncmp(param_name, suffix, suffix_len); } |
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | /* SPDX-License-Identifier: GPL-2.0 */ /* Copyright (C) B.A.T.M.A.N. contributors: * * Marek Lindner */ #ifndef _NET_BATMAN_ADV_GATEWAY_CLIENT_H_ #define _NET_BATMAN_ADV_GATEWAY_CLIENT_H_ #include "main.h" #include <linux/kref.h> #include <linux/netlink.h> #include <linux/skbuff.h> #include <linux/types.h> #include <uapi/linux/batadv_packet.h> void batadv_gw_check_client_stop(struct batadv_priv *bat_priv); void batadv_gw_reselect(struct batadv_priv *bat_priv); void batadv_gw_election(struct batadv_priv *bat_priv); struct batadv_orig_node * batadv_gw_get_selected_orig(struct batadv_priv *bat_priv); void batadv_gw_check_election(struct batadv_priv *bat_priv, struct batadv_orig_node *orig_node); void batadv_gw_node_update(struct batadv_priv *bat_priv, struct batadv_orig_node *orig_node, struct batadv_tvlv_gateway_data *gateway); void batadv_gw_node_delete(struct batadv_priv *bat_priv, struct batadv_orig_node *orig_node); void batadv_gw_node_free(struct batadv_priv *bat_priv); void batadv_gw_node_release(struct kref *ref); struct batadv_gw_node * batadv_gw_get_selected_gw_node(struct batadv_priv *bat_priv); int batadv_gw_dump(struct sk_buff *msg, struct netlink_callback *cb); bool batadv_gw_out_of_range(struct batadv_priv *bat_priv, struct sk_buff *skb); enum batadv_dhcp_recipient batadv_gw_dhcp_recipient_get(struct sk_buff *skb, unsigned int *header_len, u8 *chaddr); struct batadv_gw_node *batadv_gw_node_get(struct batadv_priv *bat_priv, struct batadv_orig_node *orig_node); /** * batadv_gw_node_put() - decrement the gw_node refcounter and possibly release * it * @gw_node: gateway node to free */ static inline void batadv_gw_node_put(struct batadv_gw_node *gw_node) { if (!gw_node) return; kref_put(&gw_node->refcount, batadv_gw_node_release); } #endif /* _NET_BATMAN_ADV_GATEWAY_CLIENT_H_ */ |
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef __LINUX_PIM_H #define __LINUX_PIM_H #include <linux/skbuff.h> #include <asm/byteorder.h> /* Message types - V1 */ #define PIM_V1_VERSION cpu_to_be32(0x10000000) #define PIM_V1_REGISTER 1 /* Message types - V2 */ #define PIM_VERSION 2 /* RFC7761, sec 4.9: * Type * Types for specific PIM messages. PIM Types are: * * Message Type Destination * --------------------------------------------------------------------- * 0 = Hello Multicast to ALL-PIM-ROUTERS * 1 = Register Unicast to RP * 2 = Register-Stop Unicast to source of Register * packet * 3 = Join/Prune Multicast to ALL-PIM-ROUTERS * 4 = Bootstrap Multicast to ALL-PIM-ROUTERS * 5 = Assert Multicast to ALL-PIM-ROUTERS * 6 = Graft (used in PIM-DM only) Unicast to RPF'(S) * 7 = Graft-Ack (used in PIM-DM only) Unicast to source of Graft * packet * 8 = Candidate-RP-Advertisement Unicast to Domain's BSR */ enum { PIM_TYPE_HELLO, PIM_TYPE_REGISTER, PIM_TYPE_REGISTER_STOP, PIM_TYPE_JOIN_PRUNE, PIM_TYPE_BOOTSTRAP, PIM_TYPE_ASSERT, PIM_TYPE_GRAFT, PIM_TYPE_GRAFT_ACK, PIM_TYPE_CANDIDATE_RP_ADV }; #define PIM_NULL_REGISTER cpu_to_be32(0x40000000) /* RFC7761, sec 4.9: * The PIM header common to all PIM messages is: * 0 1 2 3 * 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ * |PIM Ver| Type | Reserved | Checksum | * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ */ struct pimhdr { __u8 type; __u8 reserved; __be16 csum; }; /* PIMv2 register message header layout (ietf-draft-idmr-pimvsm-v2-00.ps */ struct pimreghdr { __u8 type; __u8 reserved; __be16 csum; __be32 flags; }; int pim_rcv_v1(struct sk_buff *skb); static inline bool ipmr_pimsm_enabled(void) { return IS_BUILTIN(CONFIG_IP_PIMSM_V1) || IS_BUILTIN(CONFIG_IP_PIMSM_V2); } static inline struct pimhdr *pim_hdr(const struct sk_buff *skb) { return (struct pimhdr *)skb_transport_header(skb); } static inline u8 pim_hdr_version(const struct pimhdr *pimhdr) { return pimhdr->type >> 4; } static inline u8 pim_hdr_type(const struct pimhdr *pimhdr) { return pimhdr->type & 0xf; } /* check if the address is 224.0.0.13, RFC7761 sec 4.3.1 */ static inline bool pim_ipv4_all_pim_routers(__be32 addr) { return addr == htonl(0xE000000D); } #endif |
| 848 848 131 15 8 8 8 8 6 3 3 8 8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 | // SPDX-License-Identifier: GPL-2.0-or-later /* * X.25 Packet Layer release 002 * * This is ALPHA test software. This code may break your machine, * randomly fail to work with new releases, misbehave and/or generally * screw up. It might even work. * * This code REQUIRES 2.1.15 or higher * * History * X.25 001 Jonathan Naylor Started coding. * X.25 002 Jonathan Naylor Centralised disconnect handling. * New timer architecture. * 2000-03-11 Henner Eisen MSG_EOR handling more POSIX compliant. * 2000-03-22 Daniela Squassoni Allowed disabling/enabling of * facilities negotiation and increased * the throughput upper limit. * 2000-08-27 Arnaldo C. Melo s/suser/capable/ + micro cleanups * 2000-09-04 Henner Eisen Set sock->state in x25_accept(). * Fixed x25_output() related skb leakage. * 2000-10-02 Henner Eisen Made x25_kick() single threaded per socket. * 2000-10-27 Henner Eisen MSG_DONTWAIT for fragment allocation. * 2000-11-14 Henner Eisen Closing datalink from NETDEV_GOING_DOWN * 2002-10-06 Arnaldo C. Melo Get rid of cli/sti, move proc stuff to * x25_proc.c, using seq_file * 2005-04-02 Shaun Pereira Selective sub address matching * with call user data * 2005-04-15 Shaun Pereira Fast select with no restriction on * response */ #define pr_fmt(fmt) "X25: " fmt #include <linux/module.h> #include <linux/capability.h> #include <linux/errno.h> #include <linux/kernel.h> #include <linux/sched/signal.h> #include <linux/timer.h> #include <linux/string.h> #include <linux/net.h> #include <linux/netdevice.h> #include <linux/if_arp.h> #include <linux/skbuff.h> #include <linux/slab.h> #include <net/sock.h> #include <net/tcp_states.h> #include <linux/uaccess.h> #include <linux/fcntl.h> #include <linux/termios.h> /* For TIOCINQ/OUTQ */ #include <linux/notifier.h> #include <linux/init.h> #include <linux/compat.h> #include <linux/ctype.h> #include <net/x25.h> #include <net/compat.h> int sysctl_x25_restart_request_timeout = X25_DEFAULT_T20; int sysctl_x25_call_request_timeout = X25_DEFAULT_T21; int sysctl_x25_reset_request_timeout = X25_DEFAULT_T22; int sysctl_x25_clear_request_timeout = X25_DEFAULT_T23; int sysctl_x25_ack_holdback_timeout = X25_DEFAULT_T2; int sysctl_x25_forward = 0; HLIST_HEAD(x25_list); DEFINE_RWLOCK(x25_list_lock); static const struct proto_ops x25_proto_ops; static const struct x25_address null_x25_address = {" "}; #ifdef CONFIG_COMPAT struct compat_x25_subscrip_struct { char device[200-sizeof(compat_ulong_t)]; compat_ulong_t global_facil_mask; compat_uint_t extended; }; #endif int x25_parse_address_block(struct sk_buff *skb, struct x25_address *called_addr, struct x25_address *calling_addr) { unsigned char len; int needed; int rc; if (!pskb_may_pull(skb, 1)) { /* packet has no address block */ rc = 0; goto empty; } len = *skb->data; needed = 1 + ((len >> 4) + (len & 0x0f) + 1) / 2; if (!pskb_may_pull(skb, needed)) { /* packet is too short to hold the addresses it claims to hold */ rc = -1; goto empty; } return x25_addr_ntoa(skb->data, called_addr, calling_addr); empty: *called_addr->x25_addr = 0; *calling_addr->x25_addr = 0; return rc; } int x25_addr_ntoa(unsigned char *p, struct x25_address *called_addr, struct x25_address *calling_addr) { unsigned int called_len, calling_len; char *called, *calling; unsigned int i; called_len = (*p >> 0) & 0x0F; calling_len = (*p >> 4) & 0x0F; called = called_addr->x25_addr; calling = calling_addr->x25_addr; p++; for (i = 0; i < (called_len + calling_len); i++) { if (i < called_len) { if (i % 2 != 0) { *called++ = ((*p >> 0) & 0x0F) + '0'; p++; } else { *called++ = ((*p >> 4) & 0x0F) + '0'; } } else { if (i % 2 != 0) { *calling++ = ((*p >> 0) & 0x0F) + '0'; p++; } else { *calling++ = ((*p >> 4) & 0x0F) + '0'; } } } *called = *calling = '\0'; return 1 + (called_len + calling_len + 1) / 2; } int x25_addr_aton(unsigned char *p, struct x25_address *called_addr, struct x25_address *calling_addr) { unsigned int called_len, calling_len; char *called, *calling; int i; called = called_addr->x25_addr; calling = calling_addr->x25_addr; called_len = strlen(called); calling_len = strlen(calling); *p++ = (calling_len << 4) | (called_len << 0); for (i = 0; i < (called_len + calling_len); i++) { if (i < called_len) { if (i % 2 != 0) { *p |= (*called++ - '0') << 0; p++; } else { *p = 0x00; *p |= (*called++ - '0') << 4; } } else { if (i % 2 != 0) { *p |= (*calling++ - '0') << 0; p++; } else { *p = 0x00; *p |= (*calling++ - '0') << 4; } } } return 1 + (called_len + calling_len + 1) / 2; } /* * Socket removal during an interrupt is now safe. */ static void x25_remove_socket(struct sock *sk) { write_lock_bh(&x25_list_lock); sk_del_node_init(sk); write_unlock_bh(&x25_list_lock); } /* * Handle device status changes. */ static int x25_device_event(struct notifier_block *this, unsigned long event, void *ptr) { struct net_device *dev = netdev_notifier_info_to_dev(ptr); struct x25_neigh *nb; if (!net_eq(dev_net(dev), &init_net)) return NOTIFY_DONE; if (dev->type == ARPHRD_X25) { switch (event) { case NETDEV_REGISTER: case NETDEV_POST_TYPE_CHANGE: x25_link_device_up(dev); break; case NETDEV_DOWN: nb = x25_get_neigh(dev); if (nb) { x25_link_terminated(nb); x25_neigh_put(nb); } x25_route_device_down(dev); break; case NETDEV_PRE_TYPE_CHANGE: case NETDEV_UNREGISTER: x25_link_device_down(dev); break; case NETDEV_CHANGE: if (!netif_carrier_ok(dev)) { nb = x25_get_neigh(dev); if (nb) { x25_link_terminated(nb); x25_neigh_put(nb); } } break; } } return NOTIFY_DONE; } /* * Add a socket to the bound sockets list. */ static void x25_insert_socket(struct sock *sk) { write_lock_bh(&x25_list_lock); sk_add_node(sk, &x25_list); write_unlock_bh(&x25_list_lock); } /* * Find a socket that wants to accept the Call Request we just * received. Check the full list for an address/cud match. * If no cuds match return the next_best thing, an address match. * Note: if a listening socket has cud set it must only get calls * with matching cud. */ static struct sock *x25_find_listener(struct x25_address *addr, struct sk_buff *skb) { struct sock *s; struct sock *next_best; read_lock_bh(&x25_list_lock); next_best = NULL; sk_for_each(s, &x25_list) if ((!strcmp(addr->x25_addr, x25_sk(s)->source_addr.x25_addr) || !strcmp(x25_sk(s)->source_addr.x25_addr, null_x25_address.x25_addr)) && s->sk_state == TCP_LISTEN) { /* * Found a listening socket, now check the incoming * call user data vs this sockets call user data */ if (x25_sk(s)->cudmatchlength > 0 && skb->len >= x25_sk(s)->cudmatchlength) { if((memcmp(x25_sk(s)->calluserdata.cuddata, skb->data, x25_sk(s)->cudmatchlength)) == 0) { sock_hold(s); goto found; } } else next_best = s; } if (next_best) { s = next_best; sock_hold(s); goto found; } s = NULL; found: read_unlock_bh(&x25_list_lock); return s; } /* * Find a connected X.25 socket given my LCI and neighbour. */ static struct sock *__x25_find_socket(unsigned int lci, struct x25_neigh *nb) { struct sock *s; sk_for_each(s, &x25_list) if (x25_sk(s)->lci == lci && x25_sk(s)->neighbour == nb) { sock_hold(s); goto found; } s = NULL; found: return s; } struct sock *x25_find_socket(unsigned int lci, struct x25_neigh *nb) { struct sock *s; read_lock_bh(&x25_list_lock); s = __x25_find_socket(lci, nb); read_unlock_bh(&x25_list_lock); return s; } /* * Find a unique LCI for a given device. */ static unsigned int x25_new_lci(struct x25_neigh *nb) { unsigned int lci = 1; struct sock *sk; while ((sk = x25_find_socket(lci, nb)) != NULL) { sock_put(sk); if (++lci == 4096) { lci = 0; break; } cond_resched(); } return lci; } /* * Deferred destroy. */ static void __x25_destroy_socket(struct sock *); /* * handler for deferred kills. */ static void x25_destroy_timer(struct timer_list *t) { struct sock *sk = timer_container_of(sk, t, sk_timer); x25_destroy_socket_from_timer(sk); } /* * This is called from user mode and the timers. Thus it protects itself * against interrupting users but doesn't worry about being called during * work. Once it is removed from the queue no interrupt or bottom half * will touch it and we are (fairly 8-) ) safe. * Not static as it's used by the timer */ static void __x25_destroy_socket(struct sock *sk) { struct sk_buff *skb; x25_stop_heartbeat(sk); x25_stop_timer(sk); x25_remove_socket(sk); x25_clear_queues(sk); /* Flush the queues */ while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) { if (skb->sk != sk) { /* A pending connection */ /* * Queue the unaccepted socket for death */ skb->sk->sk_state = TCP_LISTEN; sock_set_flag(skb->sk, SOCK_DEAD); x25_start_heartbeat(skb->sk); x25_sk(skb->sk)->state = X25_STATE_0; } kfree_skb(skb); } if (sk_has_allocations(sk)) { /* Defer: outstanding buffers */ sk->sk_timer.expires = jiffies + 10 * HZ; sk->sk_timer.function = x25_destroy_timer; add_timer(&sk->sk_timer); } else { /* drop last reference so sock_put will free */ __sock_put(sk); } } void x25_destroy_socket_from_timer(struct sock *sk) { sock_hold(sk); bh_lock_sock(sk); __x25_destroy_socket(sk); bh_unlock_sock(sk); sock_put(sk); } /* * Handling for system calls applied via the various interfaces to a * X.25 socket object. */ static int x25_setsockopt(struct socket *sock, int level, int optname, sockptr_t optval, unsigned int optlen) { int opt; struct sock *sk = sock->sk; int rc = -ENOPROTOOPT; if (level != SOL_X25 || optname != X25_QBITINCL) goto out; rc = -EINVAL; if (optlen < sizeof(int)) goto out; rc = -EFAULT; if (copy_from_sockptr(&opt, optval, sizeof(int))) goto out; if (opt) set_bit(X25_Q_BIT_FLAG, &x25_sk(sk)->flags); else clear_bit(X25_Q_BIT_FLAG, &x25_sk(sk)->flags); rc = 0; out: return rc; } static int x25_getsockopt(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen) { struct sock *sk = sock->sk; int val, len, rc = -ENOPROTOOPT; if (level != SOL_X25 || optname != X25_QBITINCL) goto out; rc = -EFAULT; if (get_user(len, optlen)) goto out; rc = -EINVAL; if (len < 0) goto out; len = min_t(unsigned int, len, sizeof(int)); rc = -EFAULT; if (put_user(len, optlen)) goto out; val = test_bit(X25_Q_BIT_FLAG, &x25_sk(sk)->flags); rc = copy_to_user(optval, &val, len) ? -EFAULT : 0; out: return rc; } static int x25_listen(struct socket *sock, int backlog) { struct sock *sk = sock->sk; int rc = -EOPNOTSUPP; lock_sock(sk); if (sock->state != SS_UNCONNECTED) { rc = -EINVAL; release_sock(sk); return rc; } if (sk->sk_state != TCP_LISTEN) { memset(&x25_sk(sk)->dest_addr, 0, X25_ADDR_LEN); sk->sk_max_ack_backlog = backlog; sk->sk_state = TCP_LISTEN; rc = 0; } release_sock(sk); return rc; } static struct proto x25_proto = { .name = "X25", .owner = THIS_MODULE, .obj_size = sizeof(struct x25_sock), }; static struct sock *x25_alloc_socket(struct net *net, int kern) { struct x25_sock *x25; struct sock *sk = sk_alloc(net, AF_X25, GFP_ATOMIC, &x25_proto, kern); if (!sk) goto out; sock_init_data(NULL, sk); x25 = x25_sk(sk); skb_queue_head_init(&x25->ack_queue); skb_queue_head_init(&x25->fragment_queue); skb_queue_head_init(&x25->interrupt_in_queue); skb_queue_head_init(&x25->interrupt_out_queue); out: return sk; } static int x25_create(struct net *net, struct socket *sock, int protocol, int kern) { struct sock *sk; struct x25_sock *x25; int rc = -EAFNOSUPPORT; if (!net_eq(net, &init_net)) goto out; rc = -ESOCKTNOSUPPORT; if (sock->type != SOCK_SEQPACKET) goto out; rc = -EINVAL; if (protocol) goto out; rc = -ENOMEM; if ((sk = x25_alloc_socket(net, kern)) == NULL) goto out; x25 = x25_sk(sk); sock_init_data(sock, sk); x25_init_timers(sk); sock->ops = &x25_proto_ops; sk->sk_protocol = protocol; sk->sk_backlog_rcv = x25_backlog_rcv; x25->t21 = sysctl_x25_call_request_timeout; x25->t22 = sysctl_x25_reset_request_timeout; x25->t23 = sysctl_x25_clear_request_timeout; x25->t2 = sysctl_x25_ack_holdback_timeout; x25->state = X25_STATE_0; x25->cudmatchlength = 0; set_bit(X25_ACCPT_APPRV_FLAG, &x25->flags); /* normally no cud */ /* on call accept */ x25->facilities.winsize_in = X25_DEFAULT_WINDOW_SIZE; x25->facilities.winsize_out = X25_DEFAULT_WINDOW_SIZE; x25->facilities.pacsize_in = X25_DEFAULT_PACKET_SIZE; x25->facilities.pacsize_out = X25_DEFAULT_PACKET_SIZE; x25->facilities.throughput = 0; /* by default don't negotiate throughput */ x25->facilities.reverse = X25_DEFAULT_REVERSE; x25->dte_facilities.calling_len = 0; x25->dte_facilities.called_len = 0; memset(x25->dte_facilities.called_ae, '\0', sizeof(x25->dte_facilities.called_ae)); memset(x25->dte_facilities.calling_ae, '\0', sizeof(x25->dte_facilities.calling_ae)); rc = 0; out: return rc; } static struct sock *x25_make_new(struct sock *osk) { struct sock *sk = NULL; struct x25_sock *x25, *ox25; if (osk->sk_type != SOCK_SEQPACKET) goto out; if ((sk = x25_alloc_socket(sock_net(osk), 0)) == NULL) goto out; x25 = x25_sk(sk); sk->sk_type = osk->sk_type; sk->sk_priority = READ_ONCE(osk->sk_priority); sk->sk_protocol = osk->sk_protocol; sk->sk_rcvbuf = osk->sk_rcvbuf; sk->sk_sndbuf = osk->sk_sndbuf; sk->sk_state = TCP_ESTABLISHED; sk->sk_backlog_rcv = osk->sk_backlog_rcv; sock_copy_flags(sk, osk); ox25 = x25_sk(osk); x25->t21 = ox25->t21; x25->t22 = ox25->t22; x25->t23 = ox25->t23; x25->t2 = ox25->t2; x25->flags = ox25->flags; x25->facilities = ox25->facilities; x25->dte_facilities = ox25->dte_facilities; x25->cudmatchlength = ox25->cudmatchlength; clear_bit(X25_INTERRUPT_FLAG, &x25->flags); x25_init_timers(sk); out: return sk; } static int x25_release(struct socket *sock) { struct sock *sk = sock->sk; struct x25_sock *x25; if (!sk) return 0; x25 = x25_sk(sk); sock_hold(sk); lock_sock(sk); switch (x25->state) { case X25_STATE_0: case X25_STATE_2: x25_disconnect(sk, 0, 0, 0); __x25_destroy_socket(sk); goto out; case X25_STATE_1: case X25_STATE_3: case X25_STATE_4: x25_clear_queues(sk); x25_write_internal(sk, X25_CLEAR_REQUEST); x25_start_t23timer(sk); x25->state = X25_STATE_2; sk->sk_state = TCP_CLOSE; sk->sk_shutdown |= SEND_SHUTDOWN; sk->sk_state_change(sk); sock_set_flag(sk, SOCK_DEAD); sock_set_flag(sk, SOCK_DESTROY); break; case X25_STATE_5: x25_write_internal(sk, X25_CLEAR_REQUEST); x25_disconnect(sk, 0, 0, 0); __x25_destroy_socket(sk); goto out; } sock_orphan(sk); out: release_sock(sk); sock_put(sk); return 0; } static int x25_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) { struct sock *sk = sock->sk; struct sockaddr_x25 *addr = (struct sockaddr_x25 *)uaddr; int len, i, rc = 0; if (addr_len != sizeof(struct sockaddr_x25) || addr->sx25_family != AF_X25 || strnlen(addr->sx25_addr.x25_addr, X25_ADDR_LEN) == X25_ADDR_LEN) { rc = -EINVAL; goto out; } /* check for the null_x25_address */ if (strcmp(addr->sx25_addr.x25_addr, null_x25_address.x25_addr)) { len = strlen(addr->sx25_addr.x25_addr); for (i = 0; i < len; i++) { if (!isdigit(addr->sx25_addr.x25_addr[i])) { rc = -EINVAL; goto out; } } } lock_sock(sk); if (sock_flag(sk, SOCK_ZAPPED)) { x25_sk(sk)->source_addr = addr->sx25_addr; x25_insert_socket(sk); sock_reset_flag(sk, SOCK_ZAPPED); } else { rc = -EINVAL; } release_sock(sk); net_dbg_ratelimited("x25_bind: socket is bound\n"); out: return rc; } static int x25_wait_for_connection_establishment(struct sock *sk) { DECLARE_WAITQUEUE(wait, current); int rc; add_wait_queue_exclusive(sk_sleep(sk), &wait); for (;;) { __set_current_state(TASK_INTERRUPTIBLE); rc = -ERESTARTSYS; if (signal_pending(current)) break; rc = sock_error(sk); if (rc) { sk->sk_socket->state = SS_UNCONNECTED; break; } rc = -ENOTCONN; if (sk->sk_state == TCP_CLOSE) { sk->sk_socket->state = SS_UNCONNECTED; break; } rc = 0; if (sk->sk_state != TCP_ESTABLISHED) { release_sock(sk); schedule(); lock_sock(sk); } else break; } __set_current_state(TASK_RUNNING); remove_wait_queue(sk_sleep(sk), &wait); return rc; } static int x25_connect(struct socket *sock, struct sockaddr *uaddr, int addr_len, int flags) { struct sock *sk = sock->sk; struct x25_sock *x25 = x25_sk(sk); struct sockaddr_x25 *addr = (struct sockaddr_x25 *)uaddr; struct x25_route *rt; int rc = 0; lock_sock(sk); if (sk->sk_state == TCP_ESTABLISHED && sock->state == SS_CONNECTING) { sock->state = SS_CONNECTED; goto out; /* Connect completed during a ERESTARTSYS event */ } rc = -ECONNREFUSED; if (sk->sk_state == TCP_CLOSE && sock->state == SS_CONNECTING) { sock->state = SS_UNCONNECTED; goto out; } rc = -EISCONN; /* No reconnect on a seqpacket socket */ if (sk->sk_state == TCP_ESTABLISHED) goto out; rc = -EALREADY; /* Do nothing if call is already in progress */ if (sk->sk_state == TCP_SYN_SENT) goto out; sk->sk_state = TCP_CLOSE; sock->state = SS_UNCONNECTED; rc = -EINVAL; if (addr_len != sizeof(struct sockaddr_x25) || addr->sx25_family != AF_X25 || strnlen(addr->sx25_addr.x25_addr, X25_ADDR_LEN) == X25_ADDR_LEN) goto out; rc = -ENETUNREACH; rt = x25_get_route(&addr->sx25_addr); if (!rt) goto out; x25->neighbour = x25_get_neigh(rt->dev); if (!x25->neighbour) goto out_put_route; x25_limit_facilities(&x25->facilities, x25->neighbour); x25->lci = x25_new_lci(x25->neighbour); if (!x25->lci) goto out_put_neigh; rc = -EINVAL; if (sock_flag(sk, SOCK_ZAPPED)) /* Must bind first - autobinding does not work */ goto out_put_neigh; if (!strcmp(x25->source_addr.x25_addr, null_x25_address.x25_addr)) memset(&x25->source_addr, '\0', X25_ADDR_LEN); x25->dest_addr = addr->sx25_addr; /* Move to connecting socket, start sending Connect Requests */ sock->state = SS_CONNECTING; sk->sk_state = TCP_SYN_SENT; x25->state = X25_STATE_1; x25_write_internal(sk, X25_CALL_REQUEST); x25_start_heartbeat(sk); x25_start_t21timer(sk); /* Now the loop */ rc = -EINPROGRESS; if (sk->sk_state != TCP_ESTABLISHED && (flags & O_NONBLOCK)) goto out; rc = x25_wait_for_connection_establishment(sk); if (rc) goto out_put_neigh; sock->state = SS_CONNECTED; rc = 0; out_put_neigh: if (rc && x25->neighbour) { read_lock_bh(&x25_list_lock); x25_neigh_put(x25->neighbour); x25->neighbour = NULL; read_unlock_bh(&x25_list_lock); x25->state = X25_STATE_0; } out_put_route: x25_route_put(rt); out: release_sock(sk); return rc; } static int x25_wait_for_data(struct sock *sk, long timeout) { DECLARE_WAITQUEUE(wait, current); int rc = 0; add_wait_queue_exclusive(sk_sleep(sk), &wait); for (;;) { __set_current_state(TASK_INTERRUPTIBLE); if (sk->sk_shutdown & RCV_SHUTDOWN) break; rc = -ERESTARTSYS; if (signal_pending(current)) break; rc = -EAGAIN; if (!timeout) break; rc = 0; if (skb_queue_empty(&sk->sk_receive_queue)) { release_sock(sk); timeout = schedule_timeout(timeout); lock_sock(sk); } else break; } __set_current_state(TASK_RUNNING); remove_wait_queue(sk_sleep(sk), &wait); return rc; } static int x25_accept(struct socket *sock, struct socket *newsock, struct proto_accept_arg *arg) { struct sock *sk = sock->sk; struct sock *newsk; struct sk_buff *skb; int rc = -EINVAL; if (!sk) goto out; rc = -EOPNOTSUPP; if (sk->sk_type != SOCK_SEQPACKET) goto out; lock_sock(sk); rc = -EINVAL; if (sk->sk_state != TCP_LISTEN) goto out2; rc = x25_wait_for_data(sk, READ_ONCE(sk->sk_rcvtimeo)); if (rc) goto out2; skb = skb_dequeue(&sk->sk_receive_queue); rc = -EINVAL; if (!skb->sk) goto out2; newsk = skb->sk; sock_graft(newsk, newsock); /* Now attach up the new socket */ skb->sk = NULL; kfree_skb(skb); sk_acceptq_removed(sk); newsock->state = SS_CONNECTED; rc = 0; out2: release_sock(sk); out: return rc; } static int x25_getname(struct socket *sock, struct sockaddr *uaddr, int peer) { struct sockaddr_x25 *sx25 = (struct sockaddr_x25 *)uaddr; struct sock *sk = sock->sk; struct x25_sock *x25 = x25_sk(sk); int rc = 0; if (peer) { if (sk->sk_state != TCP_ESTABLISHED) { rc = -ENOTCONN; goto out; } sx25->sx25_addr = x25->dest_addr; } else sx25->sx25_addr = x25->source_addr; sx25->sx25_family = AF_X25; rc = sizeof(*sx25); out: return rc; } int x25_rx_call_request(struct sk_buff *skb, struct x25_neigh *nb, unsigned int lci) { struct sock *sk; struct sock *make; struct x25_sock *makex25; struct x25_address source_addr, dest_addr; struct x25_facilities facilities; struct x25_dte_facilities dte_facilities; int len, addr_len, rc; /* * Remove the LCI and frame type. */ skb_pull(skb, X25_STD_MIN_LEN); /* * Extract the X.25 addresses and convert them to ASCII strings, * and remove them. * * Address block is mandatory in call request packets */ addr_len = x25_parse_address_block(skb, &source_addr, &dest_addr); if (addr_len <= 0) goto out_clear_request; skb_pull(skb, addr_len); /* * Get the length of the facilities, skip past them for the moment * get the call user data because this is needed to determine * the correct listener * * Facilities length is mandatory in call request packets */ if (!pskb_may_pull(skb, 1)) goto out_clear_request; len = skb->data[0] + 1; if (!pskb_may_pull(skb, len)) goto out_clear_request; skb_pull(skb,len); /* * Ensure that the amount of call user data is valid. */ if (skb->len > X25_MAX_CUD_LEN) goto out_clear_request; /* * Get all the call user data so it can be used in * x25_find_listener and skb_copy_from_linear_data up ahead. */ if (!pskb_may_pull(skb, skb->len)) goto out_clear_request; /* * Find a listener for the particular address/cud pair. */ sk = x25_find_listener(&source_addr,skb); skb_push(skb,len); if (sk != NULL && sk_acceptq_is_full(sk)) { goto out_sock_put; } /* * We dont have any listeners for this incoming call. * Try forwarding it. */ if (sk == NULL) { skb_push(skb, addr_len + X25_STD_MIN_LEN); if (sysctl_x25_forward && x25_forward_call(&dest_addr, nb, skb, lci) > 0) { /* Call was forwarded, dont process it any more */ kfree_skb(skb); rc = 1; goto out; } else { /* No listeners, can't forward, clear the call */ goto out_clear_request; } } /* * Try to reach a compromise on the requested facilities. */ len = x25_negotiate_facilities(skb, sk, &facilities, &dte_facilities); if (len == -1) goto out_sock_put; /* * current neighbour/link might impose additional limits * on certain facilities */ x25_limit_facilities(&facilities, nb); /* * Try to create a new socket. */ make = x25_make_new(sk); if (!make) goto out_sock_put; /* * Remove the facilities */ skb_pull(skb, len); skb->sk = make; make->sk_state = TCP_ESTABLISHED; makex25 = x25_sk(make); makex25->lci = lci; makex25->dest_addr = dest_addr; makex25->source_addr = source_addr; x25_neigh_hold(nb); makex25->neighbour = nb; makex25->facilities = facilities; makex25->dte_facilities= dte_facilities; makex25->vc_facil_mask = x25_sk(sk)->vc_facil_mask; /* ensure no reverse facil on accept */ makex25->vc_facil_mask &= ~X25_MASK_REVERSE; /* ensure no calling address extension on accept */ makex25->vc_facil_mask &= ~X25_MASK_CALLING_AE; makex25->cudmatchlength = x25_sk(sk)->cudmatchlength; /* Normally all calls are accepted immediately */ if (test_bit(X25_ACCPT_APPRV_FLAG, &makex25->flags)) { x25_write_internal(make, X25_CALL_ACCEPTED); makex25->state = X25_STATE_3; } else { makex25->state = X25_STATE_5; } /* * Incoming Call User Data. */ skb_copy_from_linear_data(skb, makex25->calluserdata.cuddata, skb->len); makex25->calluserdata.cudlength = skb->len; sk_acceptq_added(sk); x25_insert_socket(make); skb_queue_head(&sk->sk_receive_queue, skb); x25_start_heartbeat(make); if (!sock_flag(sk, SOCK_DEAD)) sk->sk_data_ready(sk); rc = 1; sock_put(sk); out: return rc; out_sock_put: sock_put(sk); out_clear_request: rc = 0; x25_transmit_clear_request(nb, lci, 0x01); goto out; } static int x25_sendmsg(struct socket *sock, struct msghdr *msg, size_t len) { struct sock *sk = sock->sk; struct x25_sock *x25 = x25_sk(sk); DECLARE_SOCKADDR(struct sockaddr_x25 *, usx25, msg->msg_name); struct sockaddr_x25 sx25; struct sk_buff *skb; unsigned char *asmptr; int noblock = msg->msg_flags & MSG_DONTWAIT; size_t size; int qbit = 0, rc = -EINVAL; lock_sock(sk); if (msg->msg_flags & ~(MSG_DONTWAIT|MSG_OOB|MSG_EOR|MSG_CMSG_COMPAT)) goto out; /* we currently don't support segmented records at the user interface */ if (!(msg->msg_flags & (MSG_EOR|MSG_OOB))) goto out; rc = -EADDRNOTAVAIL; if (sock_flag(sk, SOCK_ZAPPED)) goto out; rc = -EPIPE; if (sk->sk_shutdown & SEND_SHUTDOWN) { send_sig(SIGPIPE, current, 0); goto out; } rc = -ENETUNREACH; if (!x25->neighbour) goto out; if (usx25) { rc = -EINVAL; if (msg->msg_namelen < sizeof(sx25)) goto out; memcpy(&sx25, usx25, sizeof(sx25)); rc = -EISCONN; if (strcmp(x25->dest_addr.x25_addr, sx25.sx25_addr.x25_addr)) goto out; rc = -EINVAL; if (sx25.sx25_family != AF_X25) goto out; } else { /* * FIXME 1003.1g - if the socket is like this because * it has become closed (not started closed) we ought * to SIGPIPE, EPIPE; */ rc = -ENOTCONN; if (sk->sk_state != TCP_ESTABLISHED) goto out; sx25.sx25_family = AF_X25; sx25.sx25_addr = x25->dest_addr; } /* Sanity check the packet size */ if (len > 65535) { rc = -EMSGSIZE; goto out; } net_dbg_ratelimited("x25_sendmsg: sendto: Addresses built.\n"); /* Build a packet */ net_dbg_ratelimited("x25_sendmsg: sendto: building packet.\n"); if ((msg->msg_flags & MSG_OOB) && len > 32) len = 32; size = len + X25_MAX_L2_LEN + X25_EXT_MIN_LEN; release_sock(sk); skb = sock_alloc_send_skb(sk, size, noblock, &rc); lock_sock(sk); if (!skb) goto out; X25_SKB_CB(skb)->flags = msg->msg_flags; skb_reserve(skb, X25_MAX_L2_LEN + X25_EXT_MIN_LEN); /* * Put the data on the end */ net_dbg_ratelimited("x25_sendmsg: Copying user data\n"); skb_reset_transport_header(skb); skb_put(skb, len); rc = memcpy_from_msg(skb_transport_header(skb), msg, len); if (rc) goto out_kfree_skb; /* * If the Q BIT Include socket option is in force, the first * byte of the user data is the logical value of the Q Bit. */ if (test_bit(X25_Q_BIT_FLAG, &x25->flags)) { if (!pskb_may_pull(skb, 1)) goto out_kfree_skb; qbit = skb->data[0]; skb_pull(skb, 1); } /* * Push down the X.25 header */ net_dbg_ratelimited("x25_sendmsg: Building X.25 Header.\n"); if (msg->msg_flags & MSG_OOB) { if (x25->neighbour->extended) { asmptr = skb_push(skb, X25_STD_MIN_LEN); *asmptr++ = ((x25->lci >> 8) & 0x0F) | X25_GFI_EXTSEQ; *asmptr++ = (x25->lci >> 0) & 0xFF; *asmptr++ = X25_INTERRUPT; } else { asmptr = skb_push(skb, X25_STD_MIN_LEN); *asmptr++ = ((x25->lci >> 8) & 0x0F) | X25_GFI_STDSEQ; *asmptr++ = (x25->lci >> 0) & 0xFF; *asmptr++ = X25_INTERRUPT; } } else { if (x25->neighbour->extended) { /* Build an Extended X.25 header */ asmptr = skb_push(skb, X25_EXT_MIN_LEN); *asmptr++ = ((x25->lci >> 8) & 0x0F) | X25_GFI_EXTSEQ; *asmptr++ = (x25->lci >> 0) & 0xFF; *asmptr++ = X25_DATA; *asmptr++ = X25_DATA; } else { /* Build an Standard X.25 header */ asmptr = skb_push(skb, X25_STD_MIN_LEN); *asmptr++ = ((x25->lci >> 8) & 0x0F) | X25_GFI_STDSEQ; *asmptr++ = (x25->lci >> 0) & 0xFF; *asmptr++ = X25_DATA; } if (qbit) skb->data[0] |= X25_Q_BIT; } net_dbg_ratelimited("x25_sendmsg: Built header.\n"); net_dbg_ratelimited("x25_sendmsg: Transmitting buffer\n"); rc = -ENOTCONN; if (sk->sk_state != TCP_ESTABLISHED) goto out_kfree_skb; if (msg->msg_flags & MSG_OOB) skb_queue_tail(&x25->interrupt_out_queue, skb); else { rc = x25_output(sk, skb); len = rc; if (rc < 0) kfree_skb(skb); else if (test_bit(X25_Q_BIT_FLAG, &x25->flags)) len++; } x25_kick(sk); rc = len; out: release_sock(sk); return rc; out_kfree_skb: kfree_skb(skb); goto out; } static int x25_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, int flags) { struct sock *sk = sock->sk; struct x25_sock *x25 = x25_sk(sk); DECLARE_SOCKADDR(struct sockaddr_x25 *, sx25, msg->msg_name); size_t copied; int qbit, header_len; struct sk_buff *skb; unsigned char *asmptr; int rc = -ENOTCONN; lock_sock(sk); if (x25->neighbour == NULL) goto out; header_len = x25->neighbour->extended ? X25_EXT_MIN_LEN : X25_STD_MIN_LEN; /* * This works for seqpacket too. The receiver has ordered the queue for * us! We do one quick check first though */ if (sk->sk_state != TCP_ESTABLISHED) goto out; if (flags & MSG_OOB) { rc = -EINVAL; if (sock_flag(sk, SOCK_URGINLINE) || !skb_peek(&x25->interrupt_in_queue)) goto out; skb = skb_dequeue(&x25->interrupt_in_queue); if (!pskb_may_pull(skb, X25_STD_MIN_LEN)) goto out_free_dgram; skb_pull(skb, X25_STD_MIN_LEN); /* * No Q bit information on Interrupt data. */ if (test_bit(X25_Q_BIT_FLAG, &x25->flags)) { asmptr = skb_push(skb, 1); *asmptr = 0x00; } msg->msg_flags |= MSG_OOB; } else { /* Now we can treat all alike */ release_sock(sk); skb = skb_recv_datagram(sk, flags, &rc); lock_sock(sk); if (!skb) goto out; if (!pskb_may_pull(skb, header_len)) goto out_free_dgram; qbit = (skb->data[0] & X25_Q_BIT) == X25_Q_BIT; skb_pull(skb, header_len); if (test_bit(X25_Q_BIT_FLAG, &x25->flags)) { asmptr = skb_push(skb, 1); *asmptr = qbit; } } skb_reset_transport_header(skb); copied = skb->len; if (copied > size) { copied = size; msg->msg_flags |= MSG_TRUNC; } /* Currently, each datagram always contains a complete record */ msg->msg_flags |= MSG_EOR; rc = skb_copy_datagram_msg(skb, 0, msg, copied); if (rc) goto out_free_dgram; if (sx25) { sx25->sx25_family = AF_X25; sx25->sx25_addr = x25->dest_addr; msg->msg_namelen = sizeof(*sx25); } x25_check_rbuf(sk); rc = copied; out_free_dgram: skb_free_datagram(sk, skb); out: release_sock(sk); return rc; } static int x25_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) { struct sock *sk = sock->sk; struct x25_sock *x25 = x25_sk(sk); void __user *argp = (void __user *)arg; int rc; switch (cmd) { case TIOCOUTQ: { int amount; amount = sk->sk_sndbuf - sk_wmem_alloc_get(sk); if (amount < 0) amount = 0; rc = put_user(amount, (unsigned int __user *)argp); break; } case TIOCINQ: { struct sk_buff *skb; int amount = 0; /* * These two are safe on a single CPU system as * only user tasks fiddle here */ lock_sock(sk); if ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) amount = skb->len; release_sock(sk); rc = put_user(amount, (unsigned int __user *)argp); break; } case SIOCGIFADDR: case SIOCSIFADDR: case SIOCGIFDSTADDR: case SIOCSIFDSTADDR: case SIOCGIFBRDADDR: case SIOCSIFBRDADDR: case SIOCGIFNETMASK: case SIOCSIFNETMASK: case SIOCGIFMETRIC: case SIOCSIFMETRIC: rc = -EINVAL; break; case SIOCADDRT: case SIOCDELRT: rc = -EPERM; if (!capable(CAP_NET_ADMIN)) break; rc = x25_route_ioctl(cmd, argp); break; case SIOCX25GSUBSCRIP: rc = x25_subscr_ioctl(cmd, argp); break; case SIOCX25SSUBSCRIP: rc = -EPERM; if (!capable(CAP_NET_ADMIN)) break; rc = x25_subscr_ioctl(cmd, argp); break; case SIOCX25GFACILITIES: { lock_sock(sk); rc = copy_to_user(argp, &x25->facilities, sizeof(x25->facilities)) ? -EFAULT : 0; release_sock(sk); break; } case SIOCX25SFACILITIES: { struct x25_facilities facilities; rc = -EFAULT; if (copy_from_user(&facilities, argp, sizeof(facilities))) break; rc = -EINVAL; lock_sock(sk); if (sk->sk_state != TCP_LISTEN && sk->sk_state != TCP_CLOSE) goto out_fac_release; if (facilities.pacsize_in < X25_PS16 || facilities.pacsize_in > X25_PS4096) goto out_fac_release; if (facilities.pacsize_out < X25_PS16 || facilities.pacsize_out > X25_PS4096) goto out_fac_release; if (facilities.winsize_in < 1 || facilities.winsize_in > 127) goto out_fac_release; if (facilities.throughput) { int out = facilities.throughput & 0xf0; int in = facilities.throughput & 0x0f; if (!out) facilities.throughput |= X25_DEFAULT_THROUGHPUT << 4; else if (out < 0x30 || out > 0xD0) goto out_fac_release; if (!in) facilities.throughput |= X25_DEFAULT_THROUGHPUT; else if (in < 0x03 || in > 0x0D) goto out_fac_release; } if (facilities.reverse && (facilities.reverse & 0x81) != 0x81) goto out_fac_release; x25->facilities = facilities; rc = 0; out_fac_release: release_sock(sk); break; } case SIOCX25GDTEFACILITIES: { lock_sock(sk); rc = copy_to_user(argp, &x25->dte_facilities, sizeof(x25->dte_facilities)); release_sock(sk); if (rc) rc = -EFAULT; break; } case SIOCX25SDTEFACILITIES: { struct x25_dte_facilities dtefacs; rc = -EFAULT; if (copy_from_user(&dtefacs, argp, sizeof(dtefacs))) break; rc = -EINVAL; lock_sock(sk); if (sk->sk_state != TCP_LISTEN && sk->sk_state != TCP_CLOSE) goto out_dtefac_release; if (dtefacs.calling_len > X25_MAX_AE_LEN) goto out_dtefac_release; if (dtefacs.called_len > X25_MAX_AE_LEN) goto out_dtefac_release; x25->dte_facilities = dtefacs; rc = 0; out_dtefac_release: release_sock(sk); break; } case SIOCX25GCALLUSERDATA: { lock_sock(sk); rc = copy_to_user(argp, &x25->calluserdata, sizeof(x25->calluserdata)) ? -EFAULT : 0; release_sock(sk); break; } case SIOCX25SCALLUSERDATA: { struct x25_calluserdata calluserdata; rc = -EFAULT; if (copy_from_user(&calluserdata, argp, sizeof(calluserdata))) break; rc = -EINVAL; if (calluserdata.cudlength > X25_MAX_CUD_LEN) break; lock_sock(sk); x25->calluserdata = calluserdata; release_sock(sk); rc = 0; break; } case SIOCX25GCAUSEDIAG: { lock_sock(sk); rc = copy_to_user(argp, &x25->causediag, sizeof(x25->causediag)) ? -EFAULT : 0; release_sock(sk); break; } case SIOCX25SCAUSEDIAG: { struct x25_causediag causediag; rc = -EFAULT; if (copy_from_user(&causediag, argp, sizeof(causediag))) break; lock_sock(sk); x25->causediag = causediag; release_sock(sk); rc = 0; break; } case SIOCX25SCUDMATCHLEN: { struct x25_subaddr sub_addr; rc = -EINVAL; lock_sock(sk); if(sk->sk_state != TCP_CLOSE) goto out_cud_release; rc = -EFAULT; if (copy_from_user(&sub_addr, argp, sizeof(sub_addr))) goto out_cud_release; rc = -EINVAL; if (sub_addr.cudmatchlength > X25_MAX_CUD_LEN) goto out_cud_release; x25->cudmatchlength = sub_addr.cudmatchlength; rc = 0; out_cud_release: release_sock(sk); break; } case SIOCX25CALLACCPTAPPRV: { rc = -EINVAL; lock_sock(sk); if (sk->sk_state == TCP_CLOSE) { clear_bit(X25_ACCPT_APPRV_FLAG, &x25->flags); rc = 0; } release_sock(sk); break; } case SIOCX25SENDCALLACCPT: { rc = -EINVAL; lock_sock(sk); if (sk->sk_state != TCP_ESTABLISHED) goto out_sendcallaccpt_release; /* must call accptapprv above */ if (test_bit(X25_ACCPT_APPRV_FLAG, &x25->flags)) goto out_sendcallaccpt_release; x25_write_internal(sk, X25_CALL_ACCEPTED); x25->state = X25_STATE_3; rc = 0; out_sendcallaccpt_release: release_sock(sk); break; } default: rc = -ENOIOCTLCMD; break; } return rc; } static const struct net_proto_family x25_family_ops = { .family = AF_X25, .create = x25_create, .owner = THIS_MODULE, }; #ifdef CONFIG_COMPAT static int compat_x25_subscr_ioctl(unsigned int cmd, struct compat_x25_subscrip_struct __user *x25_subscr32) { struct compat_x25_subscrip_struct x25_subscr; struct x25_neigh *nb; struct net_device *dev; int rc = -EINVAL; rc = -EFAULT; if (copy_from_user(&x25_subscr, x25_subscr32, sizeof(*x25_subscr32))) goto out; rc = -EINVAL; dev = x25_dev_get(x25_subscr.device); if (dev == NULL) goto out; nb = x25_get_neigh(dev); if (nb == NULL) goto out_dev_put; dev_put(dev); if (cmd == SIOCX25GSUBSCRIP) { read_lock_bh(&x25_neigh_list_lock); x25_subscr.extended = nb->extended; x25_subscr.global_facil_mask = nb->global_facil_mask; read_unlock_bh(&x25_neigh_list_lock); rc = copy_to_user(x25_subscr32, &x25_subscr, sizeof(*x25_subscr32)) ? -EFAULT : 0; } else { rc = -EINVAL; if (x25_subscr.extended == 0 || x25_subscr.extended == 1) { rc = 0; write_lock_bh(&x25_neigh_list_lock); nb->extended = x25_subscr.extended; nb->global_facil_mask = x25_subscr.global_facil_mask; write_unlock_bh(&x25_neigh_list_lock); } } x25_neigh_put(nb); out: return rc; out_dev_put: dev_put(dev); goto out; } static int compat_x25_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) { void __user *argp = compat_ptr(arg); int rc = -ENOIOCTLCMD; switch(cmd) { case TIOCOUTQ: case TIOCINQ: rc = x25_ioctl(sock, cmd, (unsigned long)argp); break; case SIOCGIFADDR: case SIOCSIFADDR: case SIOCGIFDSTADDR: case SIOCSIFDSTADDR: case SIOCGIFBRDADDR: case SIOCSIFBRDADDR: case SIOCGIFNETMASK: case SIOCSIFNETMASK: case SIOCGIFMETRIC: case SIOCSIFMETRIC: rc = -EINVAL; break; case SIOCADDRT: case SIOCDELRT: rc = -EPERM; if (!capable(CAP_NET_ADMIN)) break; rc = x25_route_ioctl(cmd, argp); break; case SIOCX25GSUBSCRIP: rc = compat_x25_subscr_ioctl(cmd, argp); break; case SIOCX25SSUBSCRIP: rc = -EPERM; if (!capable(CAP_NET_ADMIN)) break; rc = compat_x25_subscr_ioctl(cmd, argp); break; case SIOCX25GFACILITIES: case SIOCX25SFACILITIES: case SIOCX25GDTEFACILITIES: case SIOCX25SDTEFACILITIES: case SIOCX25GCALLUSERDATA: case SIOCX25SCALLUSERDATA: case SIOCX25GCAUSEDIAG: case SIOCX25SCAUSEDIAG: case SIOCX25SCUDMATCHLEN: case SIOCX25CALLACCPTAPPRV: case SIOCX25SENDCALLACCPT: rc = x25_ioctl(sock, cmd, (unsigned long)argp); break; default: rc = -ENOIOCTLCMD; break; } return rc; } #endif static const struct proto_ops x25_proto_ops = { .family = AF_X25, .owner = THIS_MODULE, .release = x25_release, .bind = x25_bind, .connect = x25_connect, .socketpair = sock_no_socketpair, .accept = x25_accept, .getname = x25_getname, .poll = datagram_poll, .ioctl = x25_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = compat_x25_ioctl, #endif .gettstamp = sock_gettstamp, .listen = x25_listen, .shutdown = sock_no_shutdown, .setsockopt = x25_setsockopt, .getsockopt = x25_getsockopt, .sendmsg = x25_sendmsg, .recvmsg = x25_recvmsg, .mmap = sock_no_mmap, }; static struct packet_type x25_packet_type __read_mostly = { .type = cpu_to_be16(ETH_P_X25), .func = x25_lapb_receive_frame, }; static struct notifier_block x25_dev_notifier = { .notifier_call = x25_device_event, }; void x25_kill_by_neigh(struct x25_neigh *nb) { struct sock *s; write_lock_bh(&x25_list_lock); sk_for_each(s, &x25_list) { if (x25_sk(s)->neighbour == nb) { write_unlock_bh(&x25_list_lock); lock_sock(s); x25_disconnect(s, ENETUNREACH, 0, 0); release_sock(s); write_lock_bh(&x25_list_lock); } } write_unlock_bh(&x25_list_lock); /* Remove any related forwards */ x25_clear_forward_by_dev(nb->dev); } static int __init x25_init(void) { int rc; rc = proto_register(&x25_proto, 0); if (rc) goto out; rc = sock_register(&x25_family_ops); if (rc) goto out_proto; dev_add_pack(&x25_packet_type); rc = register_netdevice_notifier(&x25_dev_notifier); if (rc) goto out_sock; rc = x25_register_sysctl(); if (rc) goto out_dev; rc = x25_proc_init(); if (rc) goto out_sysctl; pr_info("Linux Version 0.2\n"); out: return rc; out_sysctl: x25_unregister_sysctl(); out_dev: unregister_netdevice_notifier(&x25_dev_notifier); out_sock: dev_remove_pack(&x25_packet_type); sock_unregister(AF_X25); out_proto: proto_unregister(&x25_proto); goto out; } module_init(x25_init); static void __exit x25_exit(void) { x25_proc_exit(); x25_link_free(); x25_route_free(); x25_unregister_sysctl(); unregister_netdevice_notifier(&x25_dev_notifier); dev_remove_pack(&x25_packet_type); sock_unregister(AF_X25); proto_unregister(&x25_proto); } module_exit(x25_exit); MODULE_AUTHOR("Jonathan Naylor <g4klx@g4klx.demon.co.uk>"); MODULE_DESCRIPTION("The X.25 Packet Layer network layer protocol"); MODULE_LICENSE("GPL"); MODULE_ALIAS_NETPROTO(PF_X25); |
| 98 584 584 1 580 37 579 6 576 584 581 19 19 19 18 292 24 279 278 280 79 78 79 79 79 79 79 78 292 293 292 294 292 79 292 14 292 291 84 1144 83 84 84 83 78 78 77 78 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 | // SPDX-License-Identifier: GPL-2.0-only #define pr_fmt(fmt) "%s: " fmt, __func__ #include <linux/kernel.h> #include <linux/sched.h> #include <linux/wait.h> #include <linux/slab.h> #include <linux/mm.h> #include <linux/percpu-refcount.h> /* * Initially, a percpu refcount is just a set of percpu counters. Initially, we * don't try to detect the ref hitting 0 - which means that get/put can just * increment or decrement the local counter. Note that the counter on a * particular cpu can (and will) wrap - this is fine, when we go to shutdown the * percpu counters will all sum to the correct value * * (More precisely: because modular arithmetic is commutative the sum of all the * percpu_count vars will be equal to what it would have been if all the gets * and puts were done to a single integer, even if some of the percpu integers * overflow or underflow). * * The real trick to implementing percpu refcounts is shutdown. We can't detect * the ref hitting 0 on every put - this would require global synchronization * and defeat the whole purpose of using percpu refs. * * What we do is require the user to keep track of the initial refcount; we know * the ref can't hit 0 before the user drops the initial ref, so as long as we * convert to non percpu mode before the initial ref is dropped everything * works. * * Converting to non percpu mode is done with some RCUish stuff in * percpu_ref_kill. Additionally, we need a bias value so that the * atomic_long_t can't hit 0 before we've added up all the percpu refs. */ #define PERCPU_COUNT_BIAS (1LU << (BITS_PER_LONG - 1)) static DEFINE_SPINLOCK(percpu_ref_switch_lock); static DECLARE_WAIT_QUEUE_HEAD(percpu_ref_switch_waitq); static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref) { return (unsigned long __percpu *) (ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD); } /** * percpu_ref_init - initialize a percpu refcount * @ref: percpu_ref to initialize * @release: function which will be called when refcount hits 0 * @flags: PERCPU_REF_INIT_* flags * @gfp: allocation mask to use * * Initializes @ref. @ref starts out in percpu mode with a refcount of 1 unless * @flags contains PERCPU_REF_INIT_ATOMIC or PERCPU_REF_INIT_DEAD. These flags * change the start state to atomic with the latter setting the initial refcount * to 0. See the definitions of PERCPU_REF_INIT_* flags for flag behaviors. * * Note that @release must not sleep - it may potentially be called from RCU * callback context by percpu_ref_kill(). */ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release, unsigned int flags, gfp_t gfp) { size_t align = max_t(size_t, 1 << __PERCPU_REF_FLAG_BITS, __alignof__(unsigned long)); unsigned long start_count = 0; struct percpu_ref_data *data; ref->percpu_count_ptr = (unsigned long) __alloc_percpu_gfp(sizeof(unsigned long), align, gfp); if (!ref->percpu_count_ptr) return -ENOMEM; data = kzalloc(sizeof(*ref->data), gfp); if (!data) { free_percpu((void __percpu *)ref->percpu_count_ptr); ref->percpu_count_ptr = 0; return -ENOMEM; } data->force_atomic = flags & PERCPU_REF_INIT_ATOMIC; data->allow_reinit = flags & PERCPU_REF_ALLOW_REINIT; if (flags & (PERCPU_REF_INIT_ATOMIC | PERCPU_REF_INIT_DEAD)) { ref->percpu_count_ptr |= __PERCPU_REF_ATOMIC; data->allow_reinit = true; } else { start_count += PERCPU_COUNT_BIAS; } if (flags & PERCPU_REF_INIT_DEAD) ref->percpu_count_ptr |= __PERCPU_REF_DEAD; else start_count++; atomic_long_set(&data->count, start_count); data->release = release; data->confirm_switch = NULL; data->ref = ref; ref->data = data; return 0; } EXPORT_SYMBOL_GPL(percpu_ref_init); static void __percpu_ref_exit(struct percpu_ref *ref) { unsigned long __percpu *percpu_count = percpu_count_ptr(ref); if (percpu_count) { /* non-NULL confirm_switch indicates switching in progress */ WARN_ON_ONCE(ref->data && ref->data->confirm_switch); free_percpu(percpu_count); ref->percpu_count_ptr = __PERCPU_REF_ATOMIC_DEAD; } } /** * percpu_ref_exit - undo percpu_ref_init() * @ref: percpu_ref to exit * * This function exits @ref. The caller is responsible for ensuring that * @ref is no longer in active use. The usual places to invoke this * function from are the @ref->release() callback or in init failure path * where percpu_ref_init() succeeded but other parts of the initialization * of the embedding object failed. */ void percpu_ref_exit(struct percpu_ref *ref) { struct percpu_ref_data *data = ref->data; unsigned long flags; __percpu_ref_exit(ref); if (!data) return; spin_lock_irqsave(&percpu_ref_switch_lock, flags); ref->percpu_count_ptr |= atomic_long_read(&ref->data->count) << __PERCPU_REF_FLAG_BITS; ref->data = NULL; spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); kfree(data); } EXPORT_SYMBOL_GPL(percpu_ref_exit); static void percpu_ref_call_confirm_rcu(struct rcu_head *rcu) { struct percpu_ref_data *data = container_of(rcu, struct percpu_ref_data, rcu); struct percpu_ref *ref = data->ref; data->confirm_switch(ref); data->confirm_switch = NULL; wake_up_all(&percpu_ref_switch_waitq); if (!data->allow_reinit) __percpu_ref_exit(ref); /* drop ref from percpu_ref_switch_to_atomic() */ percpu_ref_put(ref); } static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu) { struct percpu_ref_data *data = container_of(rcu, struct percpu_ref_data, rcu); struct percpu_ref *ref = data->ref; unsigned long __percpu *percpu_count = percpu_count_ptr(ref); static atomic_t underflows; unsigned long count = 0; int cpu; for_each_possible_cpu(cpu) count += *per_cpu_ptr(percpu_count, cpu); pr_debug("global %lu percpu %lu\n", atomic_long_read(&data->count), count); /* * It's crucial that we sum the percpu counters _before_ adding the sum * to &ref->count; since gets could be happening on one cpu while puts * happen on another, adding a single cpu's count could cause * @ref->count to hit 0 before we've got a consistent value - but the * sum of all the counts will be consistent and correct. * * Subtracting the bias value then has to happen _after_ adding count to * &ref->count; we need the bias value to prevent &ref->count from * reaching 0 before we add the percpu counts. But doing it at the same * time is equivalent and saves us atomic operations: */ atomic_long_add((long)count - PERCPU_COUNT_BIAS, &data->count); if (WARN_ONCE(atomic_long_read(&data->count) <= 0, "percpu ref (%ps) <= 0 (%ld) after switching to atomic", data->release, atomic_long_read(&data->count)) && atomic_inc_return(&underflows) < 4) { pr_err("%s(): percpu_ref underflow", __func__); mem_dump_obj(data); } /* @ref is viewed as dead on all CPUs, send out switch confirmation */ percpu_ref_call_confirm_rcu(rcu); } static void percpu_ref_noop_confirm_switch(struct percpu_ref *ref) { } static void __percpu_ref_switch_to_atomic(struct percpu_ref *ref, percpu_ref_func_t *confirm_switch) { if (ref->percpu_count_ptr & __PERCPU_REF_ATOMIC) { if (confirm_switch) confirm_switch(ref); return; } /* switching from percpu to atomic */ ref->percpu_count_ptr |= __PERCPU_REF_ATOMIC; /* * Non-NULL ->confirm_switch is used to indicate that switching is * in progress. Use noop one if unspecified. */ ref->data->confirm_switch = confirm_switch ?: percpu_ref_noop_confirm_switch; percpu_ref_get(ref); /* put after confirmation */ call_rcu_hurry(&ref->data->rcu, percpu_ref_switch_to_atomic_rcu); } static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref) { unsigned long __percpu *percpu_count = percpu_count_ptr(ref); int cpu; BUG_ON(!percpu_count); if (!(ref->percpu_count_ptr & __PERCPU_REF_ATOMIC)) return; if (WARN_ON_ONCE(!ref->data->allow_reinit)) return; atomic_long_add(PERCPU_COUNT_BIAS, &ref->data->count); /* * Restore per-cpu operation. smp_store_release() is paired * with READ_ONCE() in __ref_is_percpu() and guarantees that the * zeroing is visible to all percpu accesses which can see the * following __PERCPU_REF_ATOMIC clearing. */ for_each_possible_cpu(cpu) *per_cpu_ptr(percpu_count, cpu) = 0; smp_store_release(&ref->percpu_count_ptr, ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC); } static void __percpu_ref_switch_mode(struct percpu_ref *ref, percpu_ref_func_t *confirm_switch) { struct percpu_ref_data *data = ref->data; lockdep_assert_held(&percpu_ref_switch_lock); /* * If the previous ATOMIC switching hasn't finished yet, wait for * its completion. If the caller ensures that ATOMIC switching * isn't in progress, this function can be called from any context. */ wait_event_lock_irq(percpu_ref_switch_waitq, !data->confirm_switch, percpu_ref_switch_lock); if (data->force_atomic || percpu_ref_is_dying(ref)) __percpu_ref_switch_to_atomic(ref, confirm_switch); else __percpu_ref_switch_to_percpu(ref); } /** * percpu_ref_switch_to_atomic - switch a percpu_ref to atomic mode * @ref: percpu_ref to switch to atomic mode * @confirm_switch: optional confirmation callback * * There's no reason to use this function for the usual reference counting. * Use percpu_ref_kill[_and_confirm](). * * Schedule switching of @ref to atomic mode. All its percpu counts will * be collected to the main atomic counter. On completion, when all CPUs * are guaraneed to be in atomic mode, @confirm_switch, which may not * block, is invoked. This function may be invoked concurrently with all * the get/put operations and can safely be mixed with kill and reinit * operations. Note that @ref will stay in atomic mode across kill/reinit * cycles until percpu_ref_switch_to_percpu() is called. * * This function may block if @ref is in the process of switching to atomic * mode. If the caller ensures that @ref is not in the process of * switching to atomic mode, this function can be called from any context. */ void percpu_ref_switch_to_atomic(struct percpu_ref *ref, percpu_ref_func_t *confirm_switch) { unsigned long flags; spin_lock_irqsave(&percpu_ref_switch_lock, flags); ref->data->force_atomic = true; __percpu_ref_switch_mode(ref, confirm_switch); spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); } EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic); /** * percpu_ref_switch_to_atomic_sync - switch a percpu_ref to atomic mode * @ref: percpu_ref to switch to atomic mode * * Schedule switching the ref to atomic mode, and wait for the * switch to complete. Caller must ensure that no other thread * will switch back to percpu mode. */ void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref) { percpu_ref_switch_to_atomic(ref, NULL); wait_event(percpu_ref_switch_waitq, !ref->data->confirm_switch); } EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic_sync); /** * percpu_ref_switch_to_percpu - switch a percpu_ref to percpu mode * @ref: percpu_ref to switch to percpu mode * * There's no reason to use this function for the usual reference counting. * To re-use an expired ref, use percpu_ref_reinit(). * * Switch @ref to percpu mode. This function may be invoked concurrently * with all the get/put operations and can safely be mixed with kill and * reinit operations. This function reverses the sticky atomic state set * by PERCPU_REF_INIT_ATOMIC or percpu_ref_switch_to_atomic(). If @ref is * dying or dead, the actual switching takes place on the following * percpu_ref_reinit(). * * This function may block if @ref is in the process of switching to atomic * mode. If the caller ensures that @ref is not in the process of * switching to atomic mode, this function can be called from any context. */ void percpu_ref_switch_to_percpu(struct percpu_ref *ref) { unsigned long flags; spin_lock_irqsave(&percpu_ref_switch_lock, flags); ref->data->force_atomic = false; __percpu_ref_switch_mode(ref, NULL); spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); } EXPORT_SYMBOL_GPL(percpu_ref_switch_to_percpu); /** * percpu_ref_kill_and_confirm - drop the initial ref and schedule confirmation * @ref: percpu_ref to kill * @confirm_kill: optional confirmation callback * * Equivalent to percpu_ref_kill() but also schedules kill confirmation if * @confirm_kill is not NULL. @confirm_kill, which may not block, will be * called after @ref is seen as dead from all CPUs at which point all * further invocations of percpu_ref_tryget_live() will fail. See * percpu_ref_tryget_live() for details. * * This function normally doesn't block and can be called from any context * but it may block if @confirm_kill is specified and @ref is in the * process of switching to atomic mode by percpu_ref_switch_to_atomic(). * * There are no implied RCU grace periods between kill and release. */ void percpu_ref_kill_and_confirm(struct percpu_ref *ref, percpu_ref_func_t *confirm_kill) { unsigned long flags; spin_lock_irqsave(&percpu_ref_switch_lock, flags); WARN_ONCE(percpu_ref_is_dying(ref), "%s called more than once on %ps!", __func__, ref->data->release); ref->percpu_count_ptr |= __PERCPU_REF_DEAD; __percpu_ref_switch_mode(ref, confirm_kill); percpu_ref_put(ref); spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); } EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm); /** * percpu_ref_is_zero - test whether a percpu refcount reached zero * @ref: percpu_ref to test * * Returns %true if @ref reached zero. * * This function is safe to call as long as @ref is between init and exit. */ bool percpu_ref_is_zero(struct percpu_ref *ref) { unsigned long __percpu *percpu_count; unsigned long count, flags; if (__ref_is_percpu(ref, &percpu_count)) return false; /* protect us from being destroyed */ spin_lock_irqsave(&percpu_ref_switch_lock, flags); if (ref->data) count = atomic_long_read(&ref->data->count); else count = ref->percpu_count_ptr >> __PERCPU_REF_FLAG_BITS; spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); return count == 0; } EXPORT_SYMBOL_GPL(percpu_ref_is_zero); /** * percpu_ref_reinit - re-initialize a percpu refcount * @ref: perpcu_ref to re-initialize * * Re-initialize @ref so that it's in the same state as when it finished * percpu_ref_init() ignoring %PERCPU_REF_INIT_DEAD. @ref must have been * initialized successfully and reached 0 but not exited. * * Note that percpu_ref_tryget[_live]() are safe to perform on @ref while * this function is in progress. */ void percpu_ref_reinit(struct percpu_ref *ref) { WARN_ON_ONCE(!percpu_ref_is_zero(ref)); percpu_ref_resurrect(ref); } EXPORT_SYMBOL_GPL(percpu_ref_reinit); /** * percpu_ref_resurrect - modify a percpu refcount from dead to live * @ref: perpcu_ref to resurrect * * Modify @ref so that it's in the same state as before percpu_ref_kill() was * called. @ref must be dead but must not yet have exited. * * If @ref->release() frees @ref then the caller is responsible for * guaranteeing that @ref->release() does not get called while this * function is in progress. * * Note that percpu_ref_tryget[_live]() are safe to perform on @ref while * this function is in progress. */ void percpu_ref_resurrect(struct percpu_ref *ref) { unsigned long __percpu *percpu_count; unsigned long flags; spin_lock_irqsave(&percpu_ref_switch_lock, flags); WARN_ON_ONCE(!percpu_ref_is_dying(ref)); WARN_ON_ONCE(__ref_is_percpu(ref, &percpu_count)); ref->percpu_count_ptr &= ~__PERCPU_REF_DEAD; percpu_ref_get(ref); __percpu_ref_switch_mode(ref, NULL); spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); } EXPORT_SYMBOL_GPL(percpu_ref_resurrect); |
| 30 20 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | /* SPDX-License-Identifier: GPL-2.0 * * FUSE: Filesystem in Userspace * Copyright (C) 2001-2008 Miklos Szeredi <miklos@szeredi.hu> */ #ifndef _FS_FUSE_DEV_I_H #define _FS_FUSE_DEV_I_H #include <linux/types.h> /* Ordinary requests have even IDs, while interrupts IDs are odd */ #define FUSE_INT_REQ_BIT (1ULL << 0) #define FUSE_REQ_ID_STEP (1ULL << 1) extern struct wait_queue_head fuse_dev_waitq; struct fuse_arg; struct fuse_args; struct fuse_pqueue; struct fuse_req; struct fuse_iqueue; struct fuse_forget_link; struct fuse_copy_state { struct fuse_req *req; struct iov_iter *iter; struct pipe_buffer *pipebufs; struct pipe_buffer *currbuf; struct pipe_inode_info *pipe; unsigned long nr_segs; struct page *pg; unsigned int len; unsigned int offset; bool write:1; bool move_folios:1; bool is_uring:1; struct { unsigned int copied_sz; /* copied size into the user buffer */ } ring; }; #define FUSE_DEV_SYNC_INIT ((struct fuse_dev *) 1) #define FUSE_DEV_PTR_MASK (~1UL) static inline struct fuse_dev *__fuse_get_dev(struct file *file) { /* * Lockless access is OK, because file->private data is set * once during mount and is valid until the file is released. */ struct fuse_dev *fud = READ_ONCE(file->private_data); return (typeof(fud)) ((unsigned long) fud & FUSE_DEV_PTR_MASK); } struct fuse_dev *fuse_get_dev(struct file *file); unsigned int fuse_req_hash(u64 unique); struct fuse_req *fuse_request_find(struct fuse_pqueue *fpq, u64 unique); void fuse_dev_end_requests(struct list_head *head); void fuse_copy_init(struct fuse_copy_state *cs, bool write, struct iov_iter *iter); int fuse_copy_args(struct fuse_copy_state *cs, unsigned int numargs, unsigned int argpages, struct fuse_arg *args, int zeroing); int fuse_copy_out_args(struct fuse_copy_state *cs, struct fuse_args *args, unsigned int nbytes); void fuse_dev_queue_forget(struct fuse_iqueue *fiq, struct fuse_forget_link *forget); void fuse_dev_queue_interrupt(struct fuse_iqueue *fiq, struct fuse_req *req); bool fuse_remove_pending_req(struct fuse_req *req, spinlock_t *lock); bool fuse_request_expired(struct fuse_conn *fc, struct list_head *list); #endif |
| 7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 | /* SPDX-License-Identifier: GPL-2.0 */ /* * linux/can/dev.h * * Definitions for the CAN network device driver interface * * Copyright (C) 2006 Andrey Volkov <avolkov@varma-el.com> * Varma Electronics Oy * * Copyright (C) 2008 Wolfgang Grandegger <wg@grandegger.com> * */ #ifndef _CAN_DEV_H #define _CAN_DEV_H #include <linux/can.h> #include <linux/can/bittiming.h> #include <linux/can/error.h> #include <linux/can/length.h> #include <linux/can/netlink.h> #include <linux/can/skb.h> #include <linux/ethtool.h> #include <linux/netdevice.h> /* * CAN mode */ enum can_mode { CAN_MODE_STOP = 0, CAN_MODE_START, CAN_MODE_SLEEP }; enum can_termination_gpio { CAN_TERMINATION_GPIO_DISABLED = 0, CAN_TERMINATION_GPIO_ENABLED, CAN_TERMINATION_GPIO_MAX, }; /* * CAN common private data */ struct can_priv { struct net_device *dev; struct can_device_stats can_stats; const struct can_bittiming_const *bittiming_const; struct can_bittiming bittiming; struct data_bittiming_params fd; unsigned int bitrate_const_cnt; const u32 *bitrate_const; u32 bitrate_max; struct can_clock clock; unsigned int termination_const_cnt; const u16 *termination_const; u16 termination; struct gpio_desc *termination_gpio; u16 termination_gpio_ohms[CAN_TERMINATION_GPIO_MAX]; unsigned int echo_skb_max; struct sk_buff **echo_skb; enum can_state state; /* CAN controller features - see include/uapi/linux/can/netlink.h */ u32 ctrlmode; /* current options setting */ u32 ctrlmode_supported; /* options that can be modified by netlink */ int restart_ms; struct delayed_work restart_work; int (*do_set_bittiming)(struct net_device *dev); int (*do_set_mode)(struct net_device *dev, enum can_mode mode); int (*do_set_termination)(struct net_device *dev, u16 term); int (*do_get_state)(const struct net_device *dev, enum can_state *state); int (*do_get_berr_counter)(const struct net_device *dev, struct can_berr_counter *bec); }; static inline bool can_fd_tdc_is_enabled(const struct can_priv *priv) { return !!(priv->ctrlmode & CAN_CTRLMODE_FD_TDC_MASK); } static inline u32 can_get_static_ctrlmode(struct can_priv *priv) { return priv->ctrlmode & ~priv->ctrlmode_supported; } static inline bool can_is_canxl_dev_mtu(unsigned int mtu) { return (mtu >= CANXL_MIN_MTU && mtu <= CANXL_MAX_MTU); } /* drop skb if it does not contain a valid CAN frame for sending */ static inline bool can_dev_dropped_skb(struct net_device *dev, struct sk_buff *skb) { struct can_priv *priv = netdev_priv(dev); if (priv->ctrlmode & CAN_CTRLMODE_LISTENONLY) { netdev_info_once(dev, "interface in listen only mode, dropping skb\n"); kfree_skb(skb); dev->stats.tx_dropped++; return true; } return can_dropped_invalid_skb(dev, skb); } void can_setup(struct net_device *dev); struct net_device *alloc_candev_mqs(int sizeof_priv, unsigned int echo_skb_max, unsigned int txqs, unsigned int rxqs); #define alloc_candev(sizeof_priv, echo_skb_max) \ alloc_candev_mqs(sizeof_priv, echo_skb_max, 1, 1) #define alloc_candev_mq(sizeof_priv, echo_skb_max, count) \ alloc_candev_mqs(sizeof_priv, echo_skb_max, count, count) void free_candev(struct net_device *dev); /* a candev safe wrapper around netdev_priv */ struct can_priv *safe_candev_priv(struct net_device *dev); int open_candev(struct net_device *dev); void close_candev(struct net_device *dev); void can_set_default_mtu(struct net_device *dev); int can_change_mtu(struct net_device *dev, int new_mtu); int __must_check can_set_static_ctrlmode(struct net_device *dev, u32 static_mode); int can_eth_ioctl_hwts(struct net_device *netdev, struct ifreq *ifr, int cmd); int can_ethtool_op_get_ts_info_hwts(struct net_device *dev, struct kernel_ethtool_ts_info *info); int register_candev(struct net_device *dev); void unregister_candev(struct net_device *dev); int can_restart_now(struct net_device *dev); void can_bus_off(struct net_device *dev); const char *can_get_state_str(const enum can_state state); const char *can_get_ctrlmode_str(u32 ctrlmode); void can_state_get_by_berr_counter(const struct net_device *dev, const struct can_berr_counter *bec, enum can_state *tx_state, enum can_state *rx_state); void can_change_state(struct net_device *dev, struct can_frame *cf, enum can_state tx_state, enum can_state rx_state); #ifdef CONFIG_OF void of_can_transceiver(struct net_device *dev); #else static inline void of_can_transceiver(struct net_device *dev) { } #endif extern struct rtnl_link_ops can_link_ops; int can_netlink_register(void); void can_netlink_unregister(void); #endif /* !_CAN_DEV_H */ |
| 123 123 123 123 123 123 123 123 123 123 122 123 123 123 123 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 | // SPDX-License-Identifier: GPL-2.0 /* * linux/fs/ext4/block_validity.c * * Copyright (C) 2009 * Theodore Ts'o (tytso@mit.edu) * * Track which blocks in the filesystem are metadata blocks that * should never be used as data blocks by files or directories. */ #include <linux/time.h> #include <linux/fs.h> #include <linux/namei.h> #include <linux/quotaops.h> #include <linux/buffer_head.h> #include <linux/swap.h> #include <linux/pagemap.h> #include <linux/blkdev.h> #include <linux/slab.h> #include "ext4.h" struct ext4_system_zone { struct rb_node node; ext4_fsblk_t start_blk; unsigned int count; u32 ino; }; static struct kmem_cache *ext4_system_zone_cachep; int __init ext4_init_system_zone(void) { ext4_system_zone_cachep = KMEM_CACHE(ext4_system_zone, 0); if (ext4_system_zone_cachep == NULL) return -ENOMEM; return 0; } void ext4_exit_system_zone(void) { rcu_barrier(); kmem_cache_destroy(ext4_system_zone_cachep); } static inline int can_merge(struct ext4_system_zone *entry1, struct ext4_system_zone *entry2) { if ((entry1->start_blk + entry1->count) == entry2->start_blk && entry1->ino == entry2->ino) return 1; return 0; } static void release_system_zone(struct ext4_system_blocks *system_blks) { struct ext4_system_zone *entry, *n; rbtree_postorder_for_each_entry_safe(entry, n, &system_blks->root, node) kmem_cache_free(ext4_system_zone_cachep, entry); } /* * Mark a range of blocks as belonging to the "system zone" --- that * is, filesystem metadata blocks which should never be used by * inodes. */ static int add_system_zone(struct ext4_system_blocks *system_blks, ext4_fsblk_t start_blk, unsigned int count, u32 ino) { struct ext4_system_zone *new_entry, *entry; struct rb_node **n = &system_blks->root.rb_node, *node; struct rb_node *parent = NULL, *new_node; while (*n) { parent = *n; entry = rb_entry(parent, struct ext4_system_zone, node); if (start_blk < entry->start_blk) n = &(*n)->rb_left; else if (start_blk >= (entry->start_blk + entry->count)) n = &(*n)->rb_right; else /* Unexpected overlap of system zones. */ return -EFSCORRUPTED; } new_entry = kmem_cache_alloc(ext4_system_zone_cachep, GFP_KERNEL); if (!new_entry) return -ENOMEM; new_entry->start_blk = start_blk; new_entry->count = count; new_entry->ino = ino; new_node = &new_entry->node; rb_link_node(new_node, parent, n); rb_insert_color(new_node, &system_blks->root); /* Can we merge to the left? */ node = rb_prev(new_node); if (node) { entry = rb_entry(node, struct ext4_system_zone, node); if (can_merge(entry, new_entry)) { new_entry->start_blk = entry->start_blk; new_entry->count += entry->count; rb_erase(node, &system_blks->root); kmem_cache_free(ext4_system_zone_cachep, entry); } } /* Can we merge to the right? */ node = rb_next(new_node); if (node) { entry = rb_entry(node, struct ext4_system_zone, node); if (can_merge(new_entry, entry)) { new_entry->count += entry->count; rb_erase(node, &system_blks->root); kmem_cache_free(ext4_system_zone_cachep, entry); } } return 0; } static void debug_print_tree(struct ext4_sb_info *sbi) { struct rb_node *node; struct ext4_system_zone *entry; struct ext4_system_blocks *system_blks; int first = 1; printk(KERN_INFO "System zones: "); rcu_read_lock(); system_blks = rcu_dereference(sbi->s_system_blks); node = rb_first(&system_blks->root); while (node) { entry = rb_entry(node, struct ext4_system_zone, node); printk(KERN_CONT "%s%llu-%llu", first ? "" : ", ", entry->start_blk, entry->start_blk + entry->count - 1); first = 0; node = rb_next(node); } rcu_read_unlock(); printk(KERN_CONT "\n"); } static int ext4_protect_reserved_inode(struct super_block *sb, struct ext4_system_blocks *system_blks, u32 ino) { struct inode *inode; struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_map_blocks map; u32 i = 0, num; int err = 0, n; if ((ino < EXT4_ROOT_INO) || (ino > le32_to_cpu(sbi->s_es->s_inodes_count))) return -EINVAL; inode = ext4_iget(sb, ino, EXT4_IGET_SPECIAL); if (IS_ERR(inode)) return PTR_ERR(inode); num = (inode->i_size + sb->s_blocksize - 1) >> sb->s_blocksize_bits; while (i < num) { cond_resched(); map.m_lblk = i; map.m_len = num - i; n = ext4_map_blocks(NULL, inode, &map, 0); if (n < 0) { err = n; break; } if (n == 0) { i++; } else { err = add_system_zone(system_blks, map.m_pblk, n, ino); if (err < 0) { if (err == -EFSCORRUPTED) { EXT4_ERROR_INODE_ERR(inode, -err, "blocks %llu-%llu from inode overlap system zone", map.m_pblk, map.m_pblk + map.m_len - 1); } break; } i += n; } } iput(inode); return err; } static void ext4_destroy_system_zone(struct rcu_head *rcu) { struct ext4_system_blocks *system_blks; system_blks = container_of(rcu, struct ext4_system_blocks, rcu); release_system_zone(system_blks); kfree(system_blks); } /* * Build system zone rbtree which is used for block validity checking. * * The update of system_blks pointer in this function is protected by * sb->s_umount semaphore. However we have to be careful as we can be * racing with ext4_inode_block_valid() calls reading system_blks rbtree * protected only by RCU. That's why we first build the rbtree and then * swap it in place. */ int ext4_setup_system_zone(struct super_block *sb) { ext4_group_t ngroups = ext4_get_groups_count(sb); struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_system_blocks *system_blks; struct ext4_group_desc *gdp; ext4_group_t i; int ret; system_blks = kzalloc(sizeof(*system_blks), GFP_KERNEL); if (!system_blks) return -ENOMEM; for (i=0; i < ngroups; i++) { unsigned int meta_blks = ext4_num_base_meta_blocks(sb, i); cond_resched(); if (meta_blks != 0) { ret = add_system_zone(system_blks, ext4_group_first_block_no(sb, i), meta_blks, 0); if (ret) goto err; } gdp = ext4_get_group_desc(sb, i, NULL); ret = add_system_zone(system_blks, ext4_block_bitmap(sb, gdp), 1, 0); if (ret) goto err; ret = add_system_zone(system_blks, ext4_inode_bitmap(sb, gdp), 1, 0); if (ret) goto err; ret = add_system_zone(system_blks, ext4_inode_table(sb, gdp), sbi->s_itb_per_group, 0); if (ret) goto err; } if (ext4_has_feature_journal(sb) && sbi->s_es->s_journal_inum) { ret = ext4_protect_reserved_inode(sb, system_blks, le32_to_cpu(sbi->s_es->s_journal_inum)); if (ret) goto err; } /* * System blks rbtree complete, announce it once to prevent racing * with ext4_inode_block_valid() accessing the rbtree at the same * time. */ rcu_assign_pointer(sbi->s_system_blks, system_blks); if (test_opt(sb, DEBUG)) debug_print_tree(sbi); return 0; err: release_system_zone(system_blks); kfree(system_blks); return ret; } /* * Called when the filesystem is unmounted or when remounting it with * noblock_validity specified. * * The update of system_blks pointer in this function is protected by * sb->s_umount semaphore. However we have to be careful as we can be * racing with ext4_inode_block_valid() calls reading system_blks rbtree * protected only by RCU. So we first clear the system_blks pointer and * then free the rbtree only after RCU grace period expires. */ void ext4_release_system_zone(struct super_block *sb) { struct ext4_system_blocks *system_blks; system_blks = rcu_dereference_protected(EXT4_SB(sb)->s_system_blks, lockdep_is_held(&sb->s_umount)); rcu_assign_pointer(EXT4_SB(sb)->s_system_blks, NULL); if (system_blks) call_rcu(&system_blks->rcu, ext4_destroy_system_zone); } int ext4_sb_block_valid(struct super_block *sb, struct inode *inode, ext4_fsblk_t start_blk, unsigned int count) { struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_system_blocks *system_blks; struct ext4_system_zone *entry; struct rb_node *n; int ret = 1; if ((start_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) || (start_blk + count < start_blk) || (start_blk + count > ext4_blocks_count(sbi->s_es))) return 0; /* * Lock the system zone to prevent it being released concurrently * when doing a remount which inverse current "[no]block_validity" * mount option. */ rcu_read_lock(); system_blks = rcu_dereference(sbi->s_system_blks); if (system_blks == NULL) goto out_rcu; n = system_blks->root.rb_node; while (n) { entry = rb_entry(n, struct ext4_system_zone, node); if (start_blk + count - 1 < entry->start_blk) n = n->rb_left; else if (start_blk >= (entry->start_blk + entry->count)) n = n->rb_right; else { ret = 0; if (inode) ret = (entry->ino == inode->i_ino); break; } } out_rcu: rcu_read_unlock(); return ret; } /* * Returns 1 if the passed-in block region (start_blk, * start_blk+count) is valid; 0 if some part of the block region * overlaps with some other filesystem metadata blocks. */ int ext4_inode_block_valid(struct inode *inode, ext4_fsblk_t start_blk, unsigned int count) { return ext4_sb_block_valid(inode->i_sb, inode, start_blk, count); } int ext4_check_blockref(const char *function, unsigned int line, struct inode *inode, __le32 *p, unsigned int max) { __le32 *bref = p; unsigned int blk; journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; if (journal && inode == journal->j_inode) return 0; while (bref < p+max) { blk = le32_to_cpu(*bref++); if (blk && unlikely(!ext4_inode_block_valid(inode, blk, 1))) { ext4_error_inode(inode, function, line, blk, "invalid block"); return -EFSCORRUPTED; } } return 0; } |
| 1 1 1 1 1 4 4 4 4 4 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. */ #include "queueing.h" #include "timers.h" #include "device.h" #include "peer.h" #include "socket.h" #include "messages.h" #include "cookie.h" #include <linux/uio.h> #include <linux/inetdevice.h> #include <linux/socket.h> #include <net/ip_tunnels.h> #include <net/udp.h> #include <net/sock.h> static void wg_packet_send_handshake_initiation(struct wg_peer *peer) { struct message_handshake_initiation packet; if (!wg_birthdate_has_expired(atomic64_read(&peer->last_sent_handshake), REKEY_TIMEOUT)) return; /* This function is rate limited. */ atomic64_set(&peer->last_sent_handshake, ktime_get_coarse_boottime_ns()); net_dbg_ratelimited("%s: Sending handshake initiation to peer %llu (%pISpfsc)\n", peer->device->dev->name, peer->internal_id, &peer->endpoint.addr); if (wg_noise_handshake_create_initiation(&packet, &peer->handshake)) { wg_cookie_add_mac_to_packet(&packet, sizeof(packet), peer); wg_timers_any_authenticated_packet_traversal(peer); wg_timers_any_authenticated_packet_sent(peer); atomic64_set(&peer->last_sent_handshake, ktime_get_coarse_boottime_ns()); wg_socket_send_buffer_to_peer(peer, &packet, sizeof(packet), HANDSHAKE_DSCP); wg_timers_handshake_initiated(peer); } } void wg_packet_handshake_send_worker(struct work_struct *work) { struct wg_peer *peer = container_of(work, struct wg_peer, transmit_handshake_work); wg_packet_send_handshake_initiation(peer); wg_peer_put(peer); } void wg_packet_send_queued_handshake_initiation(struct wg_peer *peer, bool is_retry) { if (!is_retry) peer->timer_handshake_attempts = 0; rcu_read_lock_bh(); /* We check last_sent_handshake here in addition to the actual function * we're queueing up, so that we don't queue things if not strictly * necessary: */ if (!wg_birthdate_has_expired(atomic64_read(&peer->last_sent_handshake), REKEY_TIMEOUT) || unlikely(READ_ONCE(peer->is_dead))) goto out; wg_peer_get(peer); /* Queues up calling packet_send_queued_handshakes(peer), where we do a * peer_put(peer) after: */ if (!queue_work(peer->device->handshake_send_wq, &peer->transmit_handshake_work)) /* If the work was already queued, we want to drop the * extra reference: */ wg_peer_put(peer); out: rcu_read_unlock_bh(); } void wg_packet_send_handshake_response(struct wg_peer *peer) { struct message_handshake_response packet; atomic64_set(&peer->last_sent_handshake, ktime_get_coarse_boottime_ns()); net_dbg_ratelimited("%s: Sending handshake response to peer %llu (%pISpfsc)\n", peer->device->dev->name, peer->internal_id, &peer->endpoint.addr); if (wg_noise_handshake_create_response(&packet, &peer->handshake)) { wg_cookie_add_mac_to_packet(&packet, sizeof(packet), peer); if (wg_noise_handshake_begin_session(&peer->handshake, &peer->keypairs)) { wg_timers_session_derived(peer); wg_timers_any_authenticated_packet_traversal(peer); wg_timers_any_authenticated_packet_sent(peer); atomic64_set(&peer->last_sent_handshake, ktime_get_coarse_boottime_ns()); wg_socket_send_buffer_to_peer(peer, &packet, sizeof(packet), HANDSHAKE_DSCP); } } } void wg_packet_send_handshake_cookie(struct wg_device *wg, struct sk_buff *initiating_skb, __le32 sender_index) { struct message_handshake_cookie packet; net_dbg_skb_ratelimited("%s: Sending cookie response for denied handshake message for %pISpfsc\n", wg->dev->name, initiating_skb); wg_cookie_message_create(&packet, initiating_skb, sender_index, &wg->cookie_checker); wg_socket_send_buffer_as_reply_to_skb(wg, initiating_skb, &packet, sizeof(packet)); } static void keep_key_fresh(struct wg_peer *peer) { struct noise_keypair *keypair; bool send; rcu_read_lock_bh(); keypair = rcu_dereference_bh(peer->keypairs.current_keypair); send = keypair && READ_ONCE(keypair->sending.is_valid) && (atomic64_read(&keypair->sending_counter) > REKEY_AFTER_MESSAGES || (keypair->i_am_the_initiator && wg_birthdate_has_expired(keypair->sending.birthdate, REKEY_AFTER_TIME))); rcu_read_unlock_bh(); if (unlikely(send)) wg_packet_send_queued_handshake_initiation(peer, false); } static unsigned int calculate_skb_padding(struct sk_buff *skb) { unsigned int padded_size, last_unit = skb->len; if (unlikely(!PACKET_CB(skb)->mtu)) return ALIGN(last_unit, MESSAGE_PADDING_MULTIPLE) - last_unit; /* We do this modulo business with the MTU, just in case the networking * layer gives us a packet that's bigger than the MTU. In that case, we * wouldn't want the final subtraction to overflow in the case of the * padded_size being clamped. Fortunately, that's very rarely the case, * so we optimize for that not happening. */ if (unlikely(last_unit > PACKET_CB(skb)->mtu)) last_unit %= PACKET_CB(skb)->mtu; padded_size = min(PACKET_CB(skb)->mtu, ALIGN(last_unit, MESSAGE_PADDING_MULTIPLE)); return padded_size - last_unit; } static bool encrypt_packet(struct sk_buff *skb, struct noise_keypair *keypair) { unsigned int padding_len, plaintext_len, trailer_len; struct scatterlist sg[MAX_SKB_FRAGS + 8]; struct message_data *header; struct sk_buff *trailer; int num_frags; /* Force hash calculation before encryption so that flow analysis is * consistent over the inner packet. */ skb_get_hash(skb); /* Calculate lengths. */ padding_len = calculate_skb_padding(skb); trailer_len = padding_len + noise_encrypted_len(0); plaintext_len = skb->len + padding_len; /* Expand data section to have room for padding and auth tag. */ num_frags = skb_cow_data(skb, trailer_len, &trailer); if (unlikely(num_frags < 0 || num_frags > ARRAY_SIZE(sg))) return false; /* Set the padding to zeros, and make sure it and the auth tag are part * of the skb. */ memset(skb_tail_pointer(trailer), 0, padding_len); /* Expand head section to have room for our header and the network * stack's headers. */ if (unlikely(skb_cow_head(skb, DATA_PACKET_HEAD_ROOM) < 0)) return false; /* Finalize checksum calculation for the inner packet, if required. */ if (unlikely(skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb))) return false; /* Only after checksumming can we safely add on the padding at the end * and the header. */ skb_set_inner_network_header(skb, 0); header = (struct message_data *)skb_push(skb, sizeof(*header)); header->header.type = cpu_to_le32(MESSAGE_DATA); header->key_idx = keypair->remote_index; header->counter = cpu_to_le64(PACKET_CB(skb)->nonce); pskb_put(skb, trailer, trailer_len); /* Now we can encrypt the scattergather segments */ sg_init_table(sg, num_frags); if (skb_to_sgvec(skb, sg, sizeof(struct message_data), noise_encrypted_len(plaintext_len)) <= 0) return false; return chacha20poly1305_encrypt_sg_inplace(sg, plaintext_len, NULL, 0, PACKET_CB(skb)->nonce, keypair->sending.key); } void wg_packet_send_keepalive(struct wg_peer *peer) { struct sk_buff *skb; if (skb_queue_empty_lockless(&peer->staged_packet_queue)) { skb = alloc_skb(DATA_PACKET_HEAD_ROOM + MESSAGE_MINIMUM_LENGTH, GFP_ATOMIC); if (unlikely(!skb)) return; skb_reserve(skb, DATA_PACKET_HEAD_ROOM); skb->dev = peer->device->dev; PACKET_CB(skb)->mtu = skb->dev->mtu; skb_queue_tail(&peer->staged_packet_queue, skb); net_dbg_ratelimited("%s: Sending keepalive packet to peer %llu (%pISpfsc)\n", peer->device->dev->name, peer->internal_id, &peer->endpoint.addr); } wg_packet_send_staged_packets(peer); } static void wg_packet_create_data_done(struct wg_peer *peer, struct sk_buff *first) { struct sk_buff *skb, *next; bool is_keepalive, data_sent = false; wg_timers_any_authenticated_packet_traversal(peer); wg_timers_any_authenticated_packet_sent(peer); skb_list_walk_safe(first, skb, next) { is_keepalive = skb->len == message_data_len(0); if (likely(!wg_socket_send_skb_to_peer(peer, skb, PACKET_CB(skb)->ds) && !is_keepalive)) data_sent = true; } if (likely(data_sent)) wg_timers_data_sent(peer); keep_key_fresh(peer); } void wg_packet_tx_worker(struct work_struct *work) { struct wg_peer *peer = container_of(work, struct wg_peer, transmit_packet_work); struct noise_keypair *keypair; enum packet_state state; struct sk_buff *first; while ((first = wg_prev_queue_peek(&peer->tx_queue)) != NULL && (state = atomic_read_acquire(&PACKET_CB(first)->state)) != PACKET_STATE_UNCRYPTED) { wg_prev_queue_drop_peeked(&peer->tx_queue); keypair = PACKET_CB(first)->keypair; if (likely(state == PACKET_STATE_CRYPTED)) wg_packet_create_data_done(peer, first); else kfree_skb_list(first); wg_noise_keypair_put(keypair, false); wg_peer_put(peer); if (need_resched()) cond_resched(); } } void wg_packet_encrypt_worker(struct work_struct *work) { struct crypt_queue *queue = container_of(work, struct multicore_worker, work)->ptr; struct sk_buff *first, *skb, *next; while ((first = ptr_ring_consume_bh(&queue->ring)) != NULL) { enum packet_state state = PACKET_STATE_CRYPTED; skb_list_walk_safe(first, skb, next) { if (likely(encrypt_packet(skb, PACKET_CB(first)->keypair))) { wg_reset_packet(skb, true); } else { state = PACKET_STATE_DEAD; break; } } wg_queue_enqueue_per_peer_tx(first, state); if (need_resched()) cond_resched(); } } static void wg_packet_create_data(struct wg_peer *peer, struct sk_buff *first) { struct wg_device *wg = peer->device; int ret = -EINVAL; rcu_read_lock_bh(); if (unlikely(READ_ONCE(peer->is_dead))) goto err; ret = wg_queue_enqueue_per_device_and_peer(&wg->encrypt_queue, &peer->tx_queue, first, wg->packet_crypt_wq); if (unlikely(ret == -EPIPE)) wg_queue_enqueue_per_peer_tx(first, PACKET_STATE_DEAD); err: rcu_read_unlock_bh(); if (likely(!ret || ret == -EPIPE)) return; wg_noise_keypair_put(PACKET_CB(first)->keypair, false); wg_peer_put(peer); kfree_skb_list(first); } void wg_packet_purge_staged_packets(struct wg_peer *peer) { spin_lock_bh(&peer->staged_packet_queue.lock); DEV_STATS_ADD(peer->device->dev, tx_dropped, peer->staged_packet_queue.qlen); __skb_queue_purge(&peer->staged_packet_queue); spin_unlock_bh(&peer->staged_packet_queue.lock); } void wg_packet_send_staged_packets(struct wg_peer *peer) { struct noise_keypair *keypair; struct sk_buff_head packets; struct sk_buff *skb; /* Steal the current queue into our local one. */ __skb_queue_head_init(&packets); spin_lock_bh(&peer->staged_packet_queue.lock); skb_queue_splice_init(&peer->staged_packet_queue, &packets); spin_unlock_bh(&peer->staged_packet_queue.lock); if (unlikely(skb_queue_empty(&packets))) return; /* First we make sure we have a valid reference to a valid key. */ rcu_read_lock_bh(); keypair = wg_noise_keypair_get( rcu_dereference_bh(peer->keypairs.current_keypair)); rcu_read_unlock_bh(); if (unlikely(!keypair)) goto out_nokey; if (unlikely(!READ_ONCE(keypair->sending.is_valid))) goto out_nokey; if (unlikely(wg_birthdate_has_expired(keypair->sending.birthdate, REJECT_AFTER_TIME))) goto out_invalid; /* After we know we have a somewhat valid key, we now try to assign * nonces to all of the packets in the queue. If we can't assign nonces * for all of them, we just consider it a failure and wait for the next * handshake. */ skb_queue_walk(&packets, skb) { /* 0 for no outer TOS: no leak. TODO: at some later point, we * might consider using flowi->tos as outer instead. */ PACKET_CB(skb)->ds = ip_tunnel_ecn_encap(0, ip_hdr(skb), skb); PACKET_CB(skb)->nonce = atomic64_inc_return(&keypair->sending_counter) - 1; if (unlikely(PACKET_CB(skb)->nonce >= REJECT_AFTER_MESSAGES)) goto out_invalid; } packets.prev->next = NULL; wg_peer_get(keypair->entry.peer); PACKET_CB(packets.next)->keypair = keypair; wg_packet_create_data(peer, packets.next); return; out_invalid: WRITE_ONCE(keypair->sending.is_valid, false); out_nokey: wg_noise_keypair_put(keypair, false); /* We orphan the packets if we're waiting on a handshake, so that they * don't block a socket's pool. */ skb_queue_walk(&packets, skb) skb_orphan(skb); /* Then we put them back on the top of the queue. We're not too * concerned about accidentally getting things a little out of order if * packets are being added really fast, because this queue is for before * packets can even be sent and it's small anyway. */ spin_lock_bh(&peer->staged_packet_queue.lock); skb_queue_splice(&packets, &peer->staged_packet_queue); spin_unlock_bh(&peer->staged_packet_queue.lock); /* If we're exiting because there's something wrong with the key, it * means we should initiate a new handshake. */ wg_packet_send_queued_handshake_initiation(peer, false); } |
| 12 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | /* SPDX-License-Identifier: GPL-2.0-or-later */ /* * Public Key Signature Algorithm * * Copyright (c) 2023 Herbert Xu <herbert@gondor.apana.org.au> */ #ifndef _CRYPTO_INTERNAL_SIG_H #define _CRYPTO_INTERNAL_SIG_H #include <crypto/algapi.h> #include <crypto/sig.h> struct sig_instance { void (*free)(struct sig_instance *inst); union { struct { char head[offsetof(struct sig_alg, base)]; struct crypto_instance base; }; struct sig_alg alg; }; }; struct crypto_sig_spawn { struct crypto_spawn base; }; static inline void *crypto_sig_ctx(struct crypto_sig *tfm) { return crypto_tfm_ctx(&tfm->base); } /** * crypto_register_sig() -- Register public key signature algorithm * * Function registers an implementation of a public key signature algorithm * * @alg: algorithm definition * * Return: zero on success; error code in case of error */ int crypto_register_sig(struct sig_alg *alg); /** * crypto_unregister_sig() -- Unregister public key signature algorithm * * Function unregisters an implementation of a public key signature algorithm * * @alg: algorithm definition */ void crypto_unregister_sig(struct sig_alg *alg); int sig_register_instance(struct crypto_template *tmpl, struct sig_instance *inst); static inline struct sig_instance *sig_instance(struct crypto_instance *inst) { return container_of(&inst->alg, struct sig_instance, alg.base); } static inline struct sig_instance *sig_alg_instance(struct crypto_sig *tfm) { return sig_instance(crypto_tfm_alg_instance(&tfm->base)); } static inline struct crypto_instance *sig_crypto_instance(struct sig_instance *inst) { return container_of(&inst->alg.base, struct crypto_instance, alg); } static inline void *sig_instance_ctx(struct sig_instance *inst) { return crypto_instance_ctx(sig_crypto_instance(inst)); } int crypto_grab_sig(struct crypto_sig_spawn *spawn, struct crypto_instance *inst, const char *name, u32 type, u32 mask); static inline struct crypto_sig *crypto_spawn_sig(struct crypto_sig_spawn *spawn) { return crypto_spawn_tfm2(&spawn->base); } static inline void crypto_drop_sig(struct crypto_sig_spawn *spawn) { crypto_drop_spawn(&spawn->base); } static inline struct sig_alg *crypto_spawn_sig_alg(struct crypto_sig_spawn *spawn) { return container_of(spawn->base.alg, struct sig_alg, base); } #endif |
| 104 106 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 | // SPDX-License-Identifier: GPL-2.0-only /* * AppArmor security module * * This file contains AppArmor policy manipulation functions * * Copyright (C) 1998-2008 Novell/SUSE * Copyright 2009-2017 Canonical Ltd. * * AppArmor policy namespaces, allow for different sets of policies * to be loaded for tasks within the namespace. */ #include <linux/list.h> #include <linux/mutex.h> #include <linux/slab.h> #include <linux/string.h> #include "include/apparmor.h" #include "include/cred.h" #include "include/policy_ns.h" #include "include/label.h" #include "include/policy.h" /* kernel label */ struct aa_label *kernel_t; /* root profile namespace */ struct aa_ns *root_ns; const char *aa_hidden_ns_name = "---"; /** * aa_ns_visible - test if @view is visible from @curr * @curr: namespace to treat as the parent (NOT NULL) * @view: namespace to test if visible from @curr (NOT NULL) * @subns: whether view of a subns is allowed * * Returns: true if @view is visible from @curr else false */ bool aa_ns_visible(struct aa_ns *curr, struct aa_ns *view, bool subns) { if (curr == view) return true; if (!subns) return false; for ( ; view; view = view->parent) { if (view->parent == curr) return true; } return false; } /** * aa_ns_name - Find the ns name to display for @view from @curr * @curr: current namespace (NOT NULL) * @view: namespace attempting to view (NOT NULL) * @subns: are subns visible * * Returns: name of @view visible from @curr */ const char *aa_ns_name(struct aa_ns *curr, struct aa_ns *view, bool subns) { /* if view == curr then the namespace name isn't displayed */ if (curr == view) return ""; if (aa_ns_visible(curr, view, subns)) { /* at this point if a ns is visible it is in a view ns * thus the curr ns.hname is a prefix of its name. * Only output the virtualized portion of the name * Add + 2 to skip over // separating curr hname prefix * from the visible tail of the views hname */ return view->base.hname + strlen(curr->base.hname) + 2; } return aa_hidden_ns_name; } static struct aa_profile *alloc_unconfined(const char *name) { struct aa_profile *profile; profile = aa_alloc_null(NULL, name, GFP_KERNEL); if (!profile) return NULL; profile->label.flags |= FLAG_IX_ON_NAME_ERROR | FLAG_IMMUTIBLE | FLAG_NS_COUNT | FLAG_UNCONFINED; profile->mode = APPARMOR_UNCONFINED; return profile; } /** * alloc_ns - allocate, initialize and return a new namespace * @prefix: parent namespace name (MAYBE NULL) * @name: a preallocated name (NOT NULL) * * Returns: refcounted namespace or NULL on failure. */ static struct aa_ns *alloc_ns(const char *prefix, const char *name) { struct aa_ns *ns; ns = kzalloc(sizeof(*ns), GFP_KERNEL); AA_DEBUG(DEBUG_POLICY, "%s(%p)\n", __func__, ns); if (!ns) return NULL; if (!aa_policy_init(&ns->base, prefix, name, GFP_KERNEL)) goto fail_ns; INIT_LIST_HEAD(&ns->sub_ns); INIT_LIST_HEAD(&ns->rawdata_list); mutex_init(&ns->lock); init_waitqueue_head(&ns->wait); /* released by aa_free_ns() */ ns->unconfined = alloc_unconfined("unconfined"); if (!ns->unconfined) goto fail_unconfined; /* ns and ns->unconfined share ns->unconfined refcount */ ns->unconfined->ns = ns; atomic_set(&ns->uniq_null, 0); aa_labelset_init(&ns->labels); return ns; fail_unconfined: aa_policy_destroy(&ns->base); fail_ns: kfree_sensitive(ns); return NULL; } /** * aa_free_ns - free a profile namespace * @ns: the namespace to free (MAYBE NULL) * * Requires: All references to the namespace must have been put, if the * namespace was referenced by a profile confining a task, */ void aa_free_ns(struct aa_ns *ns) { if (!ns) return; aa_policy_destroy(&ns->base); aa_labelset_destroy(&ns->labels); aa_put_ns(ns->parent); ns->unconfined->ns = NULL; aa_free_profile(ns->unconfined); kfree_sensitive(ns); } /** * __aa_lookupn_ns - lookup the namespace matching @hname * @view: namespace to search in (NOT NULL) * @hname: hierarchical ns name (NOT NULL) * @n: length of @hname * * Requires: rcu_read_lock be held * * Returns: unrefcounted ns pointer or NULL if not found * * Do a relative name lookup, recursing through profile tree. */ struct aa_ns *__aa_lookupn_ns(struct aa_ns *view, const char *hname, size_t n) { struct aa_ns *ns = view; const char *split; for (split = strnstr(hname, "//", n); split; split = strnstr(hname, "//", n)) { ns = __aa_findn_ns(&ns->sub_ns, hname, split - hname); if (!ns) return NULL; n -= split + 2 - hname; hname = split + 2; } if (n) return __aa_findn_ns(&ns->sub_ns, hname, n); return NULL; } /** * aa_lookupn_ns - look up a policy namespace relative to @view * @view: namespace to search in (NOT NULL) * @name: name of namespace to find (NOT NULL) * @n: length of @name * * Returns: a refcounted namespace on the list, or NULL if no namespace * called @name exists. * * refcount released by caller */ struct aa_ns *aa_lookupn_ns(struct aa_ns *view, const char *name, size_t n) { struct aa_ns *ns = NULL; rcu_read_lock(); ns = aa_get_ns(__aa_lookupn_ns(view, name, n)); rcu_read_unlock(); return ns; } static struct aa_ns *__aa_create_ns(struct aa_ns *parent, const char *name, struct dentry *dir) { struct aa_ns *ns; int error; AA_BUG(!parent); AA_BUG(!name); AA_BUG(!mutex_is_locked(&parent->lock)); ns = alloc_ns(parent->base.hname, name); if (!ns) return ERR_PTR(-ENOMEM); ns->level = parent->level + 1; mutex_lock_nested(&ns->lock, ns->level); error = __aafs_ns_mkdir(ns, ns_subns_dir(parent), name, dir); if (error) { AA_ERROR("Failed to create interface for ns %s\n", ns->base.name); mutex_unlock(&ns->lock); aa_free_ns(ns); return ERR_PTR(error); } ns->parent = aa_get_ns(parent); list_add_rcu(&ns->base.list, &parent->sub_ns); /* add list ref */ aa_get_ns(ns); mutex_unlock(&ns->lock); return ns; } /** * __aa_find_or_create_ns - create an ns, fail if it already exists * @parent: the parent of the namespace being created * @name: the name of the namespace * @dir: if not null the dir to put the ns entries in * * Returns: the a refcounted ns that has been add or an ERR_PTR */ struct aa_ns *__aa_find_or_create_ns(struct aa_ns *parent, const char *name, struct dentry *dir) { struct aa_ns *ns; AA_BUG(!mutex_is_locked(&parent->lock)); /* try and find the specified ns */ /* released by caller */ ns = aa_get_ns(__aa_find_ns(&parent->sub_ns, name)); if (!ns) ns = __aa_create_ns(parent, name, dir); else ns = ERR_PTR(-EEXIST); /* return ref */ return ns; } /** * aa_prepare_ns - find an existing or create a new namespace of @name * @parent: ns to treat as parent * @name: the namespace to find or add (NOT NULL) * * Returns: refcounted namespace or PTR_ERR if failed to create one */ struct aa_ns *aa_prepare_ns(struct aa_ns *parent, const char *name) { struct aa_ns *ns; mutex_lock_nested(&parent->lock, parent->level); /* try and find the specified ns and if it doesn't exist create it */ /* released by caller */ ns = aa_get_ns(__aa_find_ns(&parent->sub_ns, name)); if (!ns) ns = __aa_create_ns(parent, name, NULL); mutex_unlock(&parent->lock); /* return ref */ return ns; } static void __ns_list_release(struct list_head *head); /** * destroy_ns - remove everything contained by @ns * @ns: namespace to have it contents removed (NOT NULL) */ static void destroy_ns(struct aa_ns *ns) { if (!ns) return; mutex_lock_nested(&ns->lock, ns->level); /* release all profiles in this namespace */ __aa_profile_list_release(&ns->base.profiles); /* release all sub namespaces */ __ns_list_release(&ns->sub_ns); if (ns->parent) { unsigned long flags; write_lock_irqsave(&ns->labels.lock, flags); __aa_proxy_redirect(ns_unconfined(ns), ns_unconfined(ns->parent)); write_unlock_irqrestore(&ns->labels.lock, flags); } __aafs_ns_rmdir(ns); mutex_unlock(&ns->lock); } /** * __aa_remove_ns - remove a namespace and all its children * @ns: namespace to be removed (NOT NULL) * * Requires: ns->parent->lock be held and ns removed from parent. */ void __aa_remove_ns(struct aa_ns *ns) { /* remove ns from namespace list */ list_del_rcu(&ns->base.list); destroy_ns(ns); aa_put_ns(ns); } /** * __ns_list_release - remove all profile namespaces on the list put refs * @head: list of profile namespaces (NOT NULL) * * Requires: namespace lock be held */ static void __ns_list_release(struct list_head *head) { struct aa_ns *ns, *tmp; list_for_each_entry_safe(ns, tmp, head, base.list) __aa_remove_ns(ns); } /** * aa_alloc_root_ns - allocate the root profile namespace * * Returns: %0 on success else error * */ int __init aa_alloc_root_ns(void) { struct aa_profile *kernel_p; /* released by aa_free_root_ns - used as list ref*/ root_ns = alloc_ns(NULL, "root"); if (!root_ns) return -ENOMEM; kernel_p = alloc_unconfined("kernel_t"); if (!kernel_p) { destroy_ns(root_ns); aa_free_ns(root_ns); return -ENOMEM; } kernel_t = &kernel_p->label; root_ns->unconfined->ns = aa_get_ns(root_ns); return 0; } /** * aa_free_root_ns - free the root profile namespace */ void __init aa_free_root_ns(void) { struct aa_ns *ns = root_ns; root_ns = NULL; aa_label_free(kernel_t); destroy_ns(ns); aa_put_ns(ns); } |
| 36 29 32 3 3 9 6 6 3 3 28 4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 | /* SPDX-License-Identifier: GPL-2.0-only */ /* * Landlock LSM - Ruleset management * * Copyright © 2016-2020 Mickaël Salaün <mic@digikod.net> * Copyright © 2018-2020 ANSSI */ #ifndef _SECURITY_LANDLOCK_RULESET_H #define _SECURITY_LANDLOCK_RULESET_H #include <linux/cleanup.h> #include <linux/err.h> #include <linux/mutex.h> #include <linux/rbtree.h> #include <linux/refcount.h> #include <linux/workqueue.h> #include "access.h" #include "limits.h" #include "object.h" struct landlock_hierarchy; /** * struct landlock_layer - Access rights for a given layer */ struct landlock_layer { /** * @level: Position of this layer in the layer stack. */ u16 level; /** * @access: Bitfield of allowed actions on the kernel object. They are * relative to the object type (e.g. %LANDLOCK_ACTION_FS_READ). */ access_mask_t access; }; /** * union landlock_key - Key of a ruleset's red-black tree */ union landlock_key { /** * @object: Pointer to identify a kernel object (e.g. an inode). */ struct landlock_object *object; /** * @data: Raw data to identify an arbitrary 32-bit value * (e.g. a TCP port). */ uintptr_t data; }; /** * enum landlock_key_type - Type of &union landlock_key */ enum landlock_key_type { /** * @LANDLOCK_KEY_INODE: Type of &landlock_ruleset.root_inode's node * keys. */ LANDLOCK_KEY_INODE = 1, /** * @LANDLOCK_KEY_NET_PORT: Type of &landlock_ruleset.root_net_port's * node keys. */ LANDLOCK_KEY_NET_PORT, }; /** * struct landlock_id - Unique rule identifier for a ruleset */ struct landlock_id { /** * @key: Identifies either a kernel object (e.g. an inode) or * a raw value (e.g. a TCP port). */ union landlock_key key; /** * @type: Type of a landlock_ruleset's root tree. */ const enum landlock_key_type type; }; /** * struct landlock_rule - Access rights tied to an object */ struct landlock_rule { /** * @node: Node in the ruleset's red-black tree. */ struct rb_node node; /** * @key: A union to identify either a kernel object (e.g. an inode) or * a raw data value (e.g. a network socket port). This is used as a key * for this ruleset element. The pointer is set once and never * modified. It always points to an allocated object because each rule * increments the refcount of its object. */ union landlock_key key; /** * @num_layers: Number of entries in @layers. */ u32 num_layers; /** * @layers: Stack of layers, from the latest to the newest, implemented * as a flexible array member (FAM). */ struct landlock_layer layers[] __counted_by(num_layers); }; /** * struct landlock_ruleset - Landlock ruleset * * This data structure must contain unique entries, be updatable, and quick to * match an object. */ struct landlock_ruleset { /** * @root_inode: Root of a red-black tree containing &struct * landlock_rule nodes with inode object. Once a ruleset is tied to a * process (i.e. as a domain), this tree is immutable until @usage * reaches zero. */ struct rb_root root_inode; #if IS_ENABLED(CONFIG_INET) /** * @root_net_port: Root of a red-black tree containing &struct * landlock_rule nodes with network port. Once a ruleset is tied to a * process (i.e. as a domain), this tree is immutable until @usage * reaches zero. */ struct rb_root root_net_port; #endif /* IS_ENABLED(CONFIG_INET) */ /** * @hierarchy: Enables hierarchy identification even when a parent * domain vanishes. This is needed for the ptrace protection. */ struct landlock_hierarchy *hierarchy; union { /** * @work_free: Enables to free a ruleset within a lockless * section. This is only used by * landlock_put_ruleset_deferred() when @usage reaches zero. * The fields @lock, @usage, @num_rules, @num_layers and * @access_masks are then unused. */ struct work_struct work_free; struct { /** * @lock: Protects against concurrent modifications of * @root, if @usage is greater than zero. */ struct mutex lock; /** * @usage: Number of processes (i.e. domains) or file * descriptors referencing this ruleset. */ refcount_t usage; /** * @num_rules: Number of non-overlapping (i.e. not for * the same object) rules in this ruleset. */ u32 num_rules; /** * @num_layers: Number of layers that are used in this * ruleset. This enables to check that all the layers * allow an access request. A value of 0 identifies a * non-merged ruleset (i.e. not a domain). */ u32 num_layers; /** * @access_masks: Contains the subset of filesystem and * network actions that are restricted by a ruleset. * A domain saves all layers of merged rulesets in a * stack (FAM), starting from the first layer to the * last one. These layers are used when merging * rulesets, for user space backward compatibility * (i.e. future-proof), and to properly handle merged * rulesets without overlapping access rights. These * layers are set once and never changed for the * lifetime of the ruleset. */ struct access_masks access_masks[]; }; }; }; struct landlock_ruleset * landlock_create_ruleset(const access_mask_t access_mask_fs, const access_mask_t access_mask_net, const access_mask_t scope_mask); void landlock_put_ruleset(struct landlock_ruleset *const ruleset); void landlock_put_ruleset_deferred(struct landlock_ruleset *const ruleset); DEFINE_FREE(landlock_put_ruleset, struct landlock_ruleset *, if (!IS_ERR_OR_NULL(_T)) landlock_put_ruleset(_T)) int landlock_insert_rule(struct landlock_ruleset *const ruleset, const struct landlock_id id, const access_mask_t access); struct landlock_ruleset * landlock_merge_ruleset(struct landlock_ruleset *const parent, struct landlock_ruleset *const ruleset); const struct landlock_rule * landlock_find_rule(const struct landlock_ruleset *const ruleset, const struct landlock_id id); static inline void landlock_get_ruleset(struct landlock_ruleset *const ruleset) { if (ruleset) refcount_inc(&ruleset->usage); } /** * landlock_union_access_masks - Return all access rights handled in the * domain * * @domain: Landlock ruleset (used as a domain) * * Returns: an access_masks result of the OR of all the domain's access masks. */ static inline struct access_masks landlock_union_access_masks(const struct landlock_ruleset *const domain) { union access_masks_all matches = {}; size_t layer_level; for (layer_level = 0; layer_level < domain->num_layers; layer_level++) { union access_masks_all layer = { .masks = domain->access_masks[layer_level], }; matches.all |= layer.all; } return matches.masks; } static inline void landlock_add_fs_access_mask(struct landlock_ruleset *const ruleset, const access_mask_t fs_access_mask, const u16 layer_level) { access_mask_t fs_mask = fs_access_mask & LANDLOCK_MASK_ACCESS_FS; /* Should already be checked in sys_landlock_create_ruleset(). */ WARN_ON_ONCE(fs_access_mask != fs_mask); ruleset->access_masks[layer_level].fs |= fs_mask; } static inline void landlock_add_net_access_mask(struct landlock_ruleset *const ruleset, const access_mask_t net_access_mask, const u16 layer_level) { access_mask_t net_mask = net_access_mask & LANDLOCK_MASK_ACCESS_NET; /* Should already be checked in sys_landlock_create_ruleset(). */ WARN_ON_ONCE(net_access_mask != net_mask); ruleset->access_masks[layer_level].net |= net_mask; } static inline void landlock_add_scope_mask(struct landlock_ruleset *const ruleset, const access_mask_t scope_mask, const u16 layer_level) { access_mask_t mask = scope_mask & LANDLOCK_MASK_SCOPE; /* Should already be checked in sys_landlock_create_ruleset(). */ WARN_ON_ONCE(scope_mask != mask); ruleset->access_masks[layer_level].scope |= mask; } static inline access_mask_t landlock_get_fs_access_mask(const struct landlock_ruleset *const ruleset, const u16 layer_level) { /* Handles all initially denied by default access rights. */ return ruleset->access_masks[layer_level].fs | _LANDLOCK_ACCESS_FS_INITIALLY_DENIED; } static inline access_mask_t landlock_get_net_access_mask(const struct landlock_ruleset *const ruleset, const u16 layer_level) { return ruleset->access_masks[layer_level].net; } static inline access_mask_t landlock_get_scope_mask(const struct landlock_ruleset *const ruleset, const u16 layer_level) { return ruleset->access_masks[layer_level].scope; } bool landlock_unmask_layers(const struct landlock_rule *const rule, const access_mask_t access_request, layer_mask_t (*const layer_masks)[], const size_t masks_array_size); access_mask_t landlock_init_layer_masks(const struct landlock_ruleset *const domain, const access_mask_t access_request, layer_mask_t (*const layer_masks)[], const enum landlock_key_type key_type); #endif /* _SECURITY_LANDLOCK_RULESET_H */ |
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Copyright (c) 2018 Oracle. All rights reserved. * * Trace point definitions for the "rpcgss" subsystem. */ #undef TRACE_SYSTEM #define TRACE_SYSTEM rpcgss #if !defined(_TRACE_RPCGSS_H) || defined(TRACE_HEADER_MULTI_READ) #define _TRACE_RPCGSS_H #include <linux/tracepoint.h> #include <trace/misc/sunrpc.h> /** ** GSS-API related trace events **/ TRACE_DEFINE_ENUM(RPC_GSS_SVC_NONE); TRACE_DEFINE_ENUM(RPC_GSS_SVC_INTEGRITY); TRACE_DEFINE_ENUM(RPC_GSS_SVC_PRIVACY); #define show_gss_service(x) \ __print_symbolic(x, \ { RPC_GSS_SVC_NONE, "none" }, \ { RPC_GSS_SVC_INTEGRITY, "integrity" }, \ { RPC_GSS_SVC_PRIVACY, "privacy" }) TRACE_DEFINE_ENUM(GSS_S_BAD_MECH); TRACE_DEFINE_ENUM(GSS_S_BAD_NAME); TRACE_DEFINE_ENUM(GSS_S_BAD_NAMETYPE); TRACE_DEFINE_ENUM(GSS_S_BAD_BINDINGS); TRACE_DEFINE_ENUM(GSS_S_BAD_STATUS); TRACE_DEFINE_ENUM(GSS_S_BAD_SIG); TRACE_DEFINE_ENUM(GSS_S_NO_CRED); TRACE_DEFINE_ENUM(GSS_S_NO_CONTEXT); TRACE_DEFINE_ENUM(GSS_S_DEFECTIVE_TOKEN); TRACE_DEFINE_ENUM(GSS_S_DEFECTIVE_CREDENTIAL); TRACE_DEFINE_ENUM(GSS_S_CREDENTIALS_EXPIRED); TRACE_DEFINE_ENUM(GSS_S_CONTEXT_EXPIRED); TRACE_DEFINE_ENUM(GSS_S_FAILURE); TRACE_DEFINE_ENUM(GSS_S_BAD_QOP); TRACE_DEFINE_ENUM(GSS_S_UNAUTHORIZED); TRACE_DEFINE_ENUM(GSS_S_UNAVAILABLE); TRACE_DEFINE_ENUM(GSS_S_DUPLICATE_ELEMENT); TRACE_DEFINE_ENUM(GSS_S_NAME_NOT_MN); TRACE_DEFINE_ENUM(GSS_S_CONTINUE_NEEDED); TRACE_DEFINE_ENUM(GSS_S_DUPLICATE_TOKEN); TRACE_DEFINE_ENUM(GSS_S_OLD_TOKEN); TRACE_DEFINE_ENUM(GSS_S_UNSEQ_TOKEN); TRACE_DEFINE_ENUM(GSS_S_GAP_TOKEN); #define show_gss_status(x) \ __print_symbolic(x, \ { GSS_S_BAD_MECH, "GSS_S_BAD_MECH" }, \ { GSS_S_BAD_NAME, "GSS_S_BAD_NAME" }, \ { GSS_S_BAD_NAMETYPE, "GSS_S_BAD_NAMETYPE" }, \ { GSS_S_BAD_BINDINGS, "GSS_S_BAD_BINDINGS" }, \ { GSS_S_BAD_STATUS, "GSS_S_BAD_STATUS" }, \ { GSS_S_BAD_SIG, "GSS_S_BAD_SIG" }, \ { GSS_S_NO_CRED, "GSS_S_NO_CRED" }, \ { GSS_S_NO_CONTEXT, "GSS_S_NO_CONTEXT" }, \ { GSS_S_DEFECTIVE_TOKEN, "GSS_S_DEFECTIVE_TOKEN" }, \ { GSS_S_DEFECTIVE_CREDENTIAL, "GSS_S_DEFECTIVE_CREDENTIAL" }, \ { GSS_S_CREDENTIALS_EXPIRED, "GSS_S_CREDENTIALS_EXPIRED" }, \ { GSS_S_CONTEXT_EXPIRED, "GSS_S_CONTEXT_EXPIRED" }, \ { GSS_S_FAILURE, "GSS_S_FAILURE" }, \ { GSS_S_BAD_QOP, "GSS_S_BAD_QOP" }, \ { GSS_S_UNAUTHORIZED, "GSS_S_UNAUTHORIZED" }, \ { GSS_S_UNAVAILABLE, "GSS_S_UNAVAILABLE" }, \ { GSS_S_DUPLICATE_ELEMENT, "GSS_S_DUPLICATE_ELEMENT" }, \ { GSS_S_NAME_NOT_MN, "GSS_S_NAME_NOT_MN" }, \ { GSS_S_CONTINUE_NEEDED, "GSS_S_CONTINUE_NEEDED" }, \ { GSS_S_DUPLICATE_TOKEN, "GSS_S_DUPLICATE_TOKEN" }, \ { GSS_S_OLD_TOKEN, "GSS_S_OLD_TOKEN" }, \ { GSS_S_UNSEQ_TOKEN, "GSS_S_UNSEQ_TOKEN" }, \ { GSS_S_GAP_TOKEN, "GSS_S_GAP_TOKEN" }) DECLARE_EVENT_CLASS(rpcgss_gssapi_event, TP_PROTO( const struct rpc_task *task, u32 maj_stat ), TP_ARGS(task, maj_stat), TP_STRUCT__entry( __field(unsigned int, task_id) __field(unsigned int, client_id) __field(u32, maj_stat) ), TP_fast_assign( __entry->task_id = task->tk_pid; __entry->client_id = task->tk_client->cl_clid; __entry->maj_stat = maj_stat; ), TP_printk(SUNRPC_TRACE_TASK_SPECIFIER " maj_stat=%s", __entry->task_id, __entry->client_id, __entry->maj_stat == 0 ? "GSS_S_COMPLETE" : show_gss_status(__entry->maj_stat)) ); #define DEFINE_GSSAPI_EVENT(name) \ DEFINE_EVENT(rpcgss_gssapi_event, rpcgss_##name, \ TP_PROTO( \ const struct rpc_task *task, \ u32 maj_stat \ ), \ TP_ARGS(task, maj_stat)) TRACE_EVENT(rpcgss_import_ctx, TP_PROTO( int status ), TP_ARGS(status), TP_STRUCT__entry( __field(int, status) ), TP_fast_assign( __entry->status = status; ), TP_printk("status=%d", __entry->status) ); DEFINE_GSSAPI_EVENT(get_mic); DEFINE_GSSAPI_EVENT(verify_mic); DEFINE_GSSAPI_EVENT(wrap); DEFINE_GSSAPI_EVENT(unwrap); DECLARE_EVENT_CLASS(rpcgss_ctx_class, TP_PROTO( const struct gss_cred *gc ), TP_ARGS(gc), TP_STRUCT__entry( __field(const void *, cred) __field(unsigned long, service) __string(principal, gc->gc_principal) ), TP_fast_assign( __entry->cred = gc; __entry->service = gc->gc_service; __assign_str(principal); ), TP_printk("cred=%p service=%s principal='%s'", __entry->cred, show_gss_service(__entry->service), __get_str(principal)) ); #define DEFINE_CTX_EVENT(name) \ DEFINE_EVENT(rpcgss_ctx_class, rpcgss_ctx_##name, \ TP_PROTO( \ const struct gss_cred *gc \ ), \ TP_ARGS(gc)) DEFINE_CTX_EVENT(init); DEFINE_CTX_EVENT(destroy); DECLARE_EVENT_CLASS(rpcgss_svc_gssapi_class, TP_PROTO( const struct svc_rqst *rqstp, u32 maj_stat ), TP_ARGS(rqstp, maj_stat), TP_STRUCT__entry( __field(u32, xid) __field(u32, maj_stat) __string(addr, rqstp->rq_xprt->xpt_remotebuf) ), TP_fast_assign( __entry->xid = __be32_to_cpu(rqstp->rq_xid); __entry->maj_stat = maj_stat; __assign_str(addr); ), TP_printk("addr=%s xid=0x%08x maj_stat=%s", __get_str(addr), __entry->xid, __entry->maj_stat == 0 ? "GSS_S_COMPLETE" : show_gss_status(__entry->maj_stat)) ); #define DEFINE_SVC_GSSAPI_EVENT(name) \ DEFINE_EVENT(rpcgss_svc_gssapi_class, rpcgss_svc_##name, \ TP_PROTO( \ const struct svc_rqst *rqstp, \ u32 maj_stat \ ), \ TP_ARGS(rqstp, maj_stat)) DEFINE_SVC_GSSAPI_EVENT(wrap); DEFINE_SVC_GSSAPI_EVENT(unwrap); DEFINE_SVC_GSSAPI_EVENT(mic); DEFINE_SVC_GSSAPI_EVENT(get_mic); TRACE_EVENT(rpcgss_svc_wrap_failed, TP_PROTO( const struct svc_rqst *rqstp ), TP_ARGS(rqstp), TP_STRUCT__entry( __field(u32, xid) __string(addr, rqstp->rq_xprt->xpt_remotebuf) ), TP_fast_assign( __entry->xid = be32_to_cpu(rqstp->rq_xid); __assign_str(addr); ), TP_printk("addr=%s xid=0x%08x", __get_str(addr), __entry->xid) ); TRACE_EVENT(rpcgss_svc_unwrap_failed, TP_PROTO( const struct svc_rqst *rqstp ), TP_ARGS(rqstp), TP_STRUCT__entry( __field(u32, xid) __string(addr, rqstp->rq_xprt->xpt_remotebuf) ), TP_fast_assign( __entry->xid = be32_to_cpu(rqstp->rq_xid); __assign_str(addr); ), TP_printk("addr=%s xid=0x%08x", __get_str(addr), __entry->xid) ); TRACE_EVENT(rpcgss_svc_seqno_bad, TP_PROTO( const struct svc_rqst *rqstp, u32 expected, u32 received ), TP_ARGS(rqstp, expected, received), TP_STRUCT__entry( __field(u32, expected) __field(u32, received) __field(u32, xid) __string(addr, rqstp->rq_xprt->xpt_remotebuf) ), TP_fast_assign( __entry->expected = expected; __entry->received = received; __entry->xid = __be32_to_cpu(rqstp->rq_xid); __assign_str(addr); ), TP_printk("addr=%s xid=0x%08x expected seqno %u, received seqno %u", __get_str(addr), __entry->xid, __entry->expected, __entry->received) ); TRACE_EVENT(rpcgss_svc_accept_upcall, TP_PROTO( const struct svc_rqst *rqstp, u32 major_status, u32 minor_status ), TP_ARGS(rqstp, major_status, minor_status), TP_STRUCT__entry( __field(u32, minor_status) __field(unsigned long, major_status) __field(u32, xid) __string(addr, rqstp->rq_xprt->xpt_remotebuf) ), TP_fast_assign( __entry->minor_status = minor_status; __entry->major_status = major_status; __entry->xid = be32_to_cpu(rqstp->rq_xid); __assign_str(addr); ), TP_printk("addr=%s xid=0x%08x major_status=%s (0x%08lx) minor_status=%u", __get_str(addr), __entry->xid, (__entry->major_status == 0) ? "GSS_S_COMPLETE" : show_gss_status(__entry->major_status), __entry->major_status, __entry->minor_status ) ); TRACE_EVENT(rpcgss_svc_authenticate, TP_PROTO( const struct svc_rqst *rqstp, const struct rpc_gss_wire_cred *gc ), TP_ARGS(rqstp, gc), TP_STRUCT__entry( __field(u32, seqno) __field(u32, xid) __string(addr, rqstp->rq_xprt->xpt_remotebuf) ), TP_fast_assign( __entry->xid = be32_to_cpu(rqstp->rq_xid); __entry->seqno = gc->gc_seq; __assign_str(addr); ), TP_printk("addr=%s xid=0x%08x seqno=%u", __get_str(addr), __entry->xid, __entry->seqno) ); /** ** GSS auth unwrap failures **/ TRACE_EVENT(rpcgss_unwrap_failed, TP_PROTO( const struct rpc_task *task ), TP_ARGS(task), TP_STRUCT__entry( __field(unsigned int, task_id) __field(unsigned int, client_id) ), TP_fast_assign( __entry->task_id = task->tk_pid; __entry->client_id = task->tk_client->cl_clid; ), TP_printk(SUNRPC_TRACE_TASK_SPECIFIER, __entry->task_id, __entry->client_id) ); TRACE_EVENT(rpcgss_bad_seqno, TP_PROTO( const struct rpc_task *task, u32 expected, u32 received ), TP_ARGS(task, expected, received), TP_STRUCT__entry( __field(unsigned int, task_id) __field(unsigned int, client_id) __field(u32, expected) __field(u32, received) ), TP_fast_assign( __entry->task_id = task->tk_pid; __entry->client_id = task->tk_client->cl_clid; __entry->expected = expected; __entry->received = received; ), TP_printk(SUNRPC_TRACE_TASK_SPECIFIER " expected seqno %u, received seqno %u", __entry->task_id, __entry->client_id, __entry->expected, __entry->received) ); TRACE_EVENT(rpcgss_seqno, TP_PROTO( const struct rpc_task *task ), TP_ARGS(task), TP_STRUCT__entry( __field(unsigned int, task_id) __field(unsigned int, client_id) __field(u32, xid) __field(u32, seqno) ), TP_fast_assign( const struct rpc_rqst *rqst = task->tk_rqstp; __entry->task_id = task->tk_pid; __entry->client_id = task->tk_client->cl_clid; __entry->xid = be32_to_cpu(rqst->rq_xid); __entry->seqno = *rqst->rq_seqnos; ), TP_printk(SUNRPC_TRACE_TASK_SPECIFIER " xid=0x%08x seqno=%u", __entry->task_id, __entry->client_id, __entry->xid, __entry->seqno) ); TRACE_EVENT(rpcgss_need_reencode, TP_PROTO( const struct rpc_task *task, u32 seq_xmit, bool ret ), TP_ARGS(task, seq_xmit, ret), TP_STRUCT__entry( __field(unsigned int, task_id) __field(unsigned int, client_id) __field(u32, xid) __field(u32, seq_xmit) __field(u32, seqno) __field(bool, ret) ), TP_fast_assign( __entry->task_id = task->tk_pid; __entry->client_id = task->tk_client->cl_clid; __entry->xid = be32_to_cpu(task->tk_rqstp->rq_xid); __entry->seq_xmit = seq_xmit; __entry->seqno = *task->tk_rqstp->rq_seqnos; __entry->ret = ret; ), TP_printk(SUNRPC_TRACE_TASK_SPECIFIER " xid=0x%08x rq_seqno=%u seq_xmit=%u reencode %sneeded", __entry->task_id, __entry->client_id, __entry->xid, __entry->seqno, __entry->seq_xmit, __entry->ret ? "" : "un") ); TRACE_EVENT(rpcgss_update_slack, TP_PROTO( const struct rpc_task *task, const struct rpc_auth *auth ), TP_ARGS(task, auth), TP_STRUCT__entry( __field(unsigned int, task_id) __field(unsigned int, client_id) __field(u32, xid) __field(const void *, auth) __field(unsigned int, rslack) __field(unsigned int, ralign) __field(unsigned int, verfsize) ), TP_fast_assign( __entry->task_id = task->tk_pid; __entry->client_id = task->tk_client->cl_clid; __entry->xid = be32_to_cpu(task->tk_rqstp->rq_xid); __entry->auth = auth; __entry->rslack = auth->au_rslack; __entry->ralign = auth->au_ralign; __entry->verfsize = auth->au_verfsize; ), TP_printk(SUNRPC_TRACE_TASK_SPECIFIER " xid=0x%08x auth=%p rslack=%u ralign=%u verfsize=%u\n", __entry->task_id, __entry->client_id, __entry->xid, __entry->auth, __entry->rslack, __entry->ralign, __entry->verfsize) ); DECLARE_EVENT_CLASS(rpcgss_svc_seqno_class, TP_PROTO( const struct svc_rqst *rqstp, u32 seqno ), TP_ARGS(rqstp, seqno), TP_STRUCT__entry( __field(u32, xid) __field(u32, seqno) ), TP_fast_assign( __entry->xid = be32_to_cpu(rqstp->rq_xid); __entry->seqno = seqno; ), TP_printk("xid=0x%08x seqno=%u", __entry->xid, __entry->seqno) ); #define DEFINE_SVC_SEQNO_EVENT(name) \ DEFINE_EVENT(rpcgss_svc_seqno_class, rpcgss_svc_seqno_##name, \ TP_PROTO( \ const struct svc_rqst *rqstp, \ u32 seqno \ ), \ TP_ARGS(rqstp, seqno)) DEFINE_SVC_SEQNO_EVENT(large); DEFINE_SVC_SEQNO_EVENT(seen); TRACE_EVENT(rpcgss_svc_seqno_low, TP_PROTO( const struct svc_rqst *rqstp, u32 seqno, u32 min, u32 max ), TP_ARGS(rqstp, seqno, min, max), TP_STRUCT__entry( __field(u32, xid) __field(u32, seqno) __field(u32, min) __field(u32, max) ), TP_fast_assign( __entry->xid = be32_to_cpu(rqstp->rq_xid); __entry->seqno = seqno; __entry->min = min; __entry->max = max; ), TP_printk("xid=0x%08x seqno=%u window=[%u..%u]", __entry->xid, __entry->seqno, __entry->min, __entry->max) ); /** ** gssd upcall related trace events **/ TRACE_EVENT(rpcgss_upcall_msg, TP_PROTO( const char *buf ), TP_ARGS(buf), TP_STRUCT__entry( __string(msg, buf) ), TP_fast_assign( __assign_str(msg); ), TP_printk("msg='%s'", __get_str(msg)) ); TRACE_EVENT(rpcgss_upcall_result, TP_PROTO( u32 uid, int result ), TP_ARGS(uid, result), TP_STRUCT__entry( __field(u32, uid) __field(int, result) ), TP_fast_assign( __entry->uid = uid; __entry->result = result; ), TP_printk("for uid %u, result=%d", __entry->uid, __entry->result) ); TRACE_EVENT(rpcgss_context, TP_PROTO( u32 window_size, unsigned long expiry, unsigned long now, unsigned int timeout, unsigned int len, const u8 *data ), TP_ARGS(window_size, expiry, now, timeout, len, data), TP_STRUCT__entry( __field(unsigned long, expiry) __field(unsigned long, now) __field(unsigned int, timeout) __field(u32, window_size) __field(int, len) __string_len(acceptor, data, len) ), TP_fast_assign( __entry->expiry = expiry; __entry->now = now; __entry->timeout = timeout; __entry->window_size = window_size; __entry->len = len; __assign_str(acceptor); ), TP_printk("win_size=%u expiry=%lu now=%lu timeout=%u acceptor=%.*s", __entry->window_size, __entry->expiry, __entry->now, __entry->timeout, __entry->len, __get_str(acceptor)) ); /** ** Miscellaneous events */ TRACE_DEFINE_ENUM(RPC_AUTH_GSS_KRB5); TRACE_DEFINE_ENUM(RPC_AUTH_GSS_KRB5I); TRACE_DEFINE_ENUM(RPC_AUTH_GSS_KRB5P); #define show_pseudoflavor(x) \ __print_symbolic(x, \ { RPC_AUTH_GSS_KRB5, "RPC_AUTH_GSS_KRB5" }, \ { RPC_AUTH_GSS_KRB5I, "RPC_AUTH_GSS_KRB5I" }, \ { RPC_AUTH_GSS_KRB5P, "RPC_AUTH_GSS_KRB5P" }) TRACE_EVENT(rpcgss_createauth, TP_PROTO( unsigned int flavor, int error ), TP_ARGS(flavor, error), TP_STRUCT__entry( __field(unsigned int, flavor) __field(int, error) ), TP_fast_assign( __entry->flavor = flavor; __entry->error = error; ), TP_printk("flavor=%s error=%d", show_pseudoflavor(__entry->flavor), __entry->error) ); TRACE_EVENT(rpcgss_oid_to_mech, TP_PROTO( const char *oid ), TP_ARGS(oid), TP_STRUCT__entry( __string(oid, oid) ), TP_fast_assign( __assign_str(oid); ), TP_printk("mech for oid %s was not found", __get_str(oid)) ); #endif /* _TRACE_RPCGSS_H */ #include <trace/define_trace.h> |
| 2 2 2 2 2 2 2 2 2 1 1 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Scatterlist Cryptographic API. * * Procfs information. * * Copyright (c) 2002 James Morris <jmorris@intercode.com.au> * Copyright (c) 2005 Herbert Xu <herbert@gondor.apana.org.au> */ #include <linux/atomic.h> #include <linux/init.h> #include <linux/crypto.h> #include <linux/fips.h> #include <linux/module.h> /* for module_name() */ #include <linux/rwsem.h> #include <linux/proc_fs.h> #include <linux/seq_file.h> #include "internal.h" static void *c_start(struct seq_file *m, loff_t *pos) { down_read(&crypto_alg_sem); return seq_list_start(&crypto_alg_list, *pos); } static void *c_next(struct seq_file *m, void *p, loff_t *pos) { return seq_list_next(p, &crypto_alg_list, pos); } static void c_stop(struct seq_file *m, void *p) { up_read(&crypto_alg_sem); } static int c_show(struct seq_file *m, void *p) { struct crypto_alg *alg = list_entry(p, struct crypto_alg, cra_list); seq_printf(m, "name : %s\n", alg->cra_name); seq_printf(m, "driver : %s\n", alg->cra_driver_name); seq_printf(m, "module : %s\n", module_name(alg->cra_module)); seq_printf(m, "priority : %d\n", alg->cra_priority); seq_printf(m, "refcnt : %u\n", refcount_read(&alg->cra_refcnt)); seq_printf(m, "selftest : %s\n", (alg->cra_flags & CRYPTO_ALG_TESTED) ? "passed" : "unknown"); seq_printf(m, "internal : %s\n", str_yes_no(alg->cra_flags & CRYPTO_ALG_INTERNAL)); if (fips_enabled) seq_printf(m, "fips : %s\n", str_no_yes(alg->cra_flags & CRYPTO_ALG_FIPS_INTERNAL)); if (alg->cra_flags & CRYPTO_ALG_LARVAL) { seq_printf(m, "type : larval\n"); seq_printf(m, "flags : 0x%x\n", alg->cra_flags); goto out; } if (alg->cra_type && alg->cra_type->show) { alg->cra_type->show(m, alg); goto out; } switch (alg->cra_flags & CRYPTO_ALG_TYPE_MASK) { case CRYPTO_ALG_TYPE_CIPHER: seq_printf(m, "type : cipher\n"); seq_printf(m, "blocksize : %u\n", alg->cra_blocksize); seq_printf(m, "min keysize : %u\n", alg->cra_cipher.cia_min_keysize); seq_printf(m, "max keysize : %u\n", alg->cra_cipher.cia_max_keysize); break; default: seq_printf(m, "type : unknown\n"); break; } out: seq_putc(m, '\n'); return 0; } static const struct seq_operations crypto_seq_ops = { .start = c_start, .next = c_next, .stop = c_stop, .show = c_show }; void __init crypto_init_proc(void) { proc_create_seq("crypto", 0, NULL, &crypto_seq_ops); } void __exit crypto_exit_proc(void) { remove_proc_entry("crypto", NULL); } |
| 89 1001 5055 7364 167 12943 10042 3 10042 242 777 961 9 2974 589 1 10218 5984 956 8759 4358 947 1407 3872 7845 12923 11 167 4729 189 10199 253 10169 10006 94 4817 4817 4026 9597 959 2195 303 1076 2089 2258 2257 153 153 142 94 14 94 774 881 3666 1737 1642 1623 1623 1620 3693 1603 170 154 1085 1083 1062 168 169 154 149 137 138 137 165 7 3 7 2 1 138 218 190 216 215 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 | /* SPDX-License-Identifier: GPL-2.0+ */ #ifndef _LINUX_XARRAY_H #define _LINUX_XARRAY_H /* * eXtensible Arrays * Copyright (c) 2017 Microsoft Corporation * Author: Matthew Wilcox <willy@infradead.org> * * See Documentation/core-api/xarray.rst for how to use the XArray. */ #include <linux/bitmap.h> #include <linux/bug.h> #include <linux/compiler.h> #include <linux/err.h> #include <linux/gfp.h> #include <linux/kconfig.h> #include <linux/limits.h> #include <linux/lockdep.h> #include <linux/rcupdate.h> #include <linux/sched/mm.h> #include <linux/spinlock.h> #include <linux/types.h> struct list_lru; /* * The bottom two bits of the entry determine how the XArray interprets * the contents: * * 00: Pointer entry * 10: Internal entry * x1: Value entry or tagged pointer * * Attempting to store internal entries in the XArray is a bug. * * Most internal entries are pointers to the next node in the tree. * The following internal entries have a special meaning: * * 0-62: Sibling entries * 256: Retry entry * 257: Zero entry * * Errors are also represented as internal entries, but use the negative * space (-4094 to -2). They're never stored in the slots array; only * returned by the normal API. */ #define BITS_PER_XA_VALUE (BITS_PER_LONG - 1) /** * xa_mk_value() - Create an XArray entry from an integer. * @v: Value to store in XArray. * * Context: Any context. * Return: An entry suitable for storing in the XArray. */ static inline void *xa_mk_value(unsigned long v) { WARN_ON((long)v < 0); return (void *)((v << 1) | 1); } /** * xa_to_value() - Get value stored in an XArray entry. * @entry: XArray entry. * * Context: Any context. * Return: The value stored in the XArray entry. */ static inline unsigned long xa_to_value(const void *entry) { return (unsigned long)entry >> 1; } /** * xa_is_value() - Determine if an entry is a value. * @entry: XArray entry. * * Context: Any context. * Return: True if the entry is a value, false if it is a pointer. */ static inline bool xa_is_value(const void *entry) { return (unsigned long)entry & 1; } /** * xa_tag_pointer() - Create an XArray entry for a tagged pointer. * @p: Plain pointer. * @tag: Tag value (0, 1 or 3). * * If the user of the XArray prefers, they can tag their pointers instead * of storing value entries. Three tags are available (0, 1 and 3). * These are distinct from the xa_mark_t as they are not replicated up * through the array and cannot be searched for. * * Context: Any context. * Return: An XArray entry. */ static inline void *xa_tag_pointer(void *p, unsigned long tag) { return (void *)((unsigned long)p | tag); } /** * xa_untag_pointer() - Turn an XArray entry into a plain pointer. * @entry: XArray entry. * * If you have stored a tagged pointer in the XArray, call this function * to get the untagged version of the pointer. * * Context: Any context. * Return: A pointer. */ static inline void *xa_untag_pointer(void *entry) { return (void *)((unsigned long)entry & ~3UL); } /** * xa_pointer_tag() - Get the tag stored in an XArray entry. * @entry: XArray entry. * * If you have stored a tagged pointer in the XArray, call this function * to get the tag of that pointer. * * Context: Any context. * Return: A tag. */ static inline unsigned int xa_pointer_tag(void *entry) { return (unsigned long)entry & 3UL; } /* * xa_mk_internal() - Create an internal entry. * @v: Value to turn into an internal entry. * * Internal entries are used for a number of purposes. Entries 0-255 are * used for sibling entries (only 0-62 are used by the current code). 256 * is used for the retry entry. 257 is used for the reserved / zero entry. * Negative internal entries are used to represent errnos. Node pointers * are also tagged as internal entries in some situations. * * Context: Any context. * Return: An XArray internal entry corresponding to this value. */ static inline void *xa_mk_internal(unsigned long v) { return (void *)((v << 2) | 2); } /* * xa_to_internal() - Extract the value from an internal entry. * @entry: XArray entry. * * Context: Any context. * Return: The value which was stored in the internal entry. */ static inline unsigned long xa_to_internal(const void *entry) { return (unsigned long)entry >> 2; } /* * xa_is_internal() - Is the entry an internal entry? * @entry: XArray entry. * * Context: Any context. * Return: %true if the entry is an internal entry. */ static inline bool xa_is_internal(const void *entry) { return ((unsigned long)entry & 3) == 2; } #define XA_ZERO_ENTRY xa_mk_internal(257) /** * xa_is_zero() - Is the entry a zero entry? * @entry: Entry retrieved from the XArray * * The normal API will return NULL as the contents of a slot containing * a zero entry. You can only see zero entries by using the advanced API. * * Return: %true if the entry is a zero entry. */ static inline bool xa_is_zero(const void *entry) { return unlikely(entry == XA_ZERO_ENTRY); } /** * xa_is_err() - Report whether an XArray operation returned an error * @entry: Result from calling an XArray function * * If an XArray operation cannot complete an operation, it will return * a special value indicating an error. This function tells you * whether an error occurred; xa_err() tells you which error occurred. * * Context: Any context. * Return: %true if the entry indicates an error. */ static inline bool xa_is_err(const void *entry) { return unlikely(xa_is_internal(entry) && entry >= xa_mk_internal(-MAX_ERRNO)); } /** * xa_err() - Turn an XArray result into an errno. * @entry: Result from calling an XArray function. * * If an XArray operation cannot complete an operation, it will return * a special pointer value which encodes an errno. This function extracts * the errno from the pointer value, or returns 0 if the pointer does not * represent an errno. * * Context: Any context. * Return: A negative errno or 0. */ static inline int xa_err(void *entry) { /* xa_to_internal() would not do sign extension. */ if (xa_is_err(entry)) return (long)entry >> 2; return 0; } /** * struct xa_limit - Represents a range of IDs. * @min: The lowest ID to allocate (inclusive). * @max: The maximum ID to allocate (inclusive). * * This structure is used either directly or via the XA_LIMIT() macro * to communicate the range of IDs that are valid for allocation. * Three common ranges are predefined for you: * * xa_limit_32b - [0 - UINT_MAX] * * xa_limit_31b - [0 - INT_MAX] * * xa_limit_16b - [0 - USHRT_MAX] */ struct xa_limit { u32 max; u32 min; }; #define XA_LIMIT(_min, _max) (struct xa_limit) { .min = _min, .max = _max } #define xa_limit_32b XA_LIMIT(0, UINT_MAX) #define xa_limit_31b XA_LIMIT(0, INT_MAX) #define xa_limit_16b XA_LIMIT(0, USHRT_MAX) typedef unsigned __bitwise xa_mark_t; #define XA_MARK_0 ((__force xa_mark_t)0U) #define XA_MARK_1 ((__force xa_mark_t)1U) #define XA_MARK_2 ((__force xa_mark_t)2U) #define XA_PRESENT ((__force xa_mark_t)8U) #define XA_MARK_MAX XA_MARK_2 #define XA_FREE_MARK XA_MARK_0 enum xa_lock_type { XA_LOCK_IRQ = 1, XA_LOCK_BH = 2, }; /* * Values for xa_flags. The radix tree stores its GFP flags in the xa_flags, * and we remain compatible with that. */ #define XA_FLAGS_LOCK_IRQ ((__force gfp_t)XA_LOCK_IRQ) #define XA_FLAGS_LOCK_BH ((__force gfp_t)XA_LOCK_BH) #define XA_FLAGS_TRACK_FREE ((__force gfp_t)4U) #define XA_FLAGS_ZERO_BUSY ((__force gfp_t)8U) #define XA_FLAGS_ALLOC_WRAPPED ((__force gfp_t)16U) #define XA_FLAGS_ACCOUNT ((__force gfp_t)32U) #define XA_FLAGS_MARK(mark) ((__force gfp_t)((1U << __GFP_BITS_SHIFT) << \ (__force unsigned)(mark))) /* ALLOC is for a normal 0-based alloc. ALLOC1 is for an 1-based alloc */ #define XA_FLAGS_ALLOC (XA_FLAGS_TRACK_FREE | XA_FLAGS_MARK(XA_FREE_MARK)) #define XA_FLAGS_ALLOC1 (XA_FLAGS_TRACK_FREE | XA_FLAGS_ZERO_BUSY) /** * struct xarray - The anchor of the XArray. * @xa_lock: Lock that protects the contents of the XArray. * * To use the xarray, define it statically or embed it in your data structure. * It is a very small data structure, so it does not usually make sense to * allocate it separately and keep a pointer to it in your data structure. * * You may use the xa_lock to protect your own data structures as well. */ /* * If all of the entries in the array are NULL, @xa_head is a NULL pointer. * If the only non-NULL entry in the array is at index 0, @xa_head is that * entry. If any other entry in the array is non-NULL, @xa_head points * to an @xa_node. */ struct xarray { spinlock_t xa_lock; /* private: The rest of the data structure is not to be used directly. */ gfp_t xa_flags; void __rcu * xa_head; }; #define XARRAY_INIT(name, flags) { \ .xa_lock = __SPIN_LOCK_UNLOCKED(name.xa_lock), \ .xa_flags = flags, \ .xa_head = NULL, \ } /** * DEFINE_XARRAY_FLAGS() - Define an XArray with custom flags. * @name: A string that names your XArray. * @flags: XA_FLAG values. * * This is intended for file scope definitions of XArrays. It declares * and initialises an empty XArray with the chosen name and flags. It is * equivalent to calling xa_init_flags() on the array, but it does the * initialisation at compiletime instead of runtime. */ #define DEFINE_XARRAY_FLAGS(name, flags) \ struct xarray name = XARRAY_INIT(name, flags) /** * DEFINE_XARRAY() - Define an XArray. * @name: A string that names your XArray. * * This is intended for file scope definitions of XArrays. It declares * and initialises an empty XArray with the chosen name. It is equivalent * to calling xa_init() on the array, but it does the initialisation at * compiletime instead of runtime. */ #define DEFINE_XARRAY(name) DEFINE_XARRAY_FLAGS(name, 0) /** * DEFINE_XARRAY_ALLOC() - Define an XArray which allocates IDs starting at 0. * @name: A string that names your XArray. * * This is intended for file scope definitions of allocating XArrays. * See also DEFINE_XARRAY(). */ #define DEFINE_XARRAY_ALLOC(name) DEFINE_XARRAY_FLAGS(name, XA_FLAGS_ALLOC) /** * DEFINE_XARRAY_ALLOC1() - Define an XArray which allocates IDs starting at 1. * @name: A string that names your XArray. * * This is intended for file scope definitions of allocating XArrays. * See also DEFINE_XARRAY(). */ #define DEFINE_XARRAY_ALLOC1(name) DEFINE_XARRAY_FLAGS(name, XA_FLAGS_ALLOC1) void *xa_load(struct xarray *, unsigned long index); void *xa_store(struct xarray *, unsigned long index, void *entry, gfp_t); void *xa_erase(struct xarray *, unsigned long index); void *xa_store_range(struct xarray *, unsigned long first, unsigned long last, void *entry, gfp_t); bool xa_get_mark(struct xarray *, unsigned long index, xa_mark_t); void xa_set_mark(struct xarray *, unsigned long index, xa_mark_t); void xa_clear_mark(struct xarray *, unsigned long index, xa_mark_t); void *xa_find(struct xarray *xa, unsigned long *index, unsigned long max, xa_mark_t) __attribute__((nonnull(2))); void *xa_find_after(struct xarray *xa, unsigned long *index, unsigned long max, xa_mark_t) __attribute__((nonnull(2))); unsigned int xa_extract(struct xarray *, void **dst, unsigned long start, unsigned long max, unsigned int n, xa_mark_t); void xa_destroy(struct xarray *); /** * xa_init_flags() - Initialise an empty XArray with flags. * @xa: XArray. * @flags: XA_FLAG values. * * If you need to initialise an XArray with special flags (eg you need * to take the lock from interrupt context), use this function instead * of xa_init(). * * Context: Any context. */ static inline void xa_init_flags(struct xarray *xa, gfp_t flags) { spin_lock_init(&xa->xa_lock); xa->xa_flags = flags; xa->xa_head = NULL; } /** * xa_init() - Initialise an empty XArray. * @xa: XArray. * * An empty XArray is full of NULL entries. * * Context: Any context. */ static inline void xa_init(struct xarray *xa) { xa_init_flags(xa, 0); } /** * xa_empty() - Determine if an array has any present entries. * @xa: XArray. * * Context: Any context. * Return: %true if the array contains only NULL pointers. */ static inline bool xa_empty(const struct xarray *xa) { return xa->xa_head == NULL; } /** * xa_marked() - Inquire whether any entry in this array has a mark set * @xa: Array * @mark: Mark value * * Context: Any context. * Return: %true if any entry has this mark set. */ static inline bool xa_marked(const struct xarray *xa, xa_mark_t mark) { return xa->xa_flags & XA_FLAGS_MARK(mark); } /** * xa_for_each_range() - Iterate over a portion of an XArray. * @xa: XArray. * @index: Index of @entry. * @entry: Entry retrieved from array. * @start: First index to retrieve from array. * @last: Last index to retrieve from array. * * During the iteration, @entry will have the value of the entry stored * in @xa at @index. You may modify @index during the iteration if you * want to skip or reprocess indices. It is safe to modify the array * during the iteration. At the end of the iteration, @entry will be set * to NULL and @index will have a value less than or equal to max. * * xa_for_each_range() is O(n.log(n)) while xas_for_each() is O(n). You have * to handle your own locking with xas_for_each(), and if you have to unlock * after each iteration, it will also end up being O(n.log(n)). * xa_for_each_range() will spin if it hits a retry entry; if you intend to * see retry entries, you should use the xas_for_each() iterator instead. * The xas_for_each() iterator will expand into more inline code than * xa_for_each_range(). * * Context: Any context. Takes and releases the RCU lock. */ #define xa_for_each_range(xa, index, entry, start, last) \ for (index = start, \ entry = xa_find(xa, &index, last, XA_PRESENT); \ entry; \ entry = xa_find_after(xa, &index, last, XA_PRESENT)) /** * xa_for_each_start() - Iterate over a portion of an XArray. * @xa: XArray. * @index: Index of @entry. * @entry: Entry retrieved from array. * @start: First index to retrieve from array. * * During the iteration, @entry will have the value of the entry stored * in @xa at @index. You may modify @index during the iteration if you * want to skip or reprocess indices. It is safe to modify the array * during the iteration. At the end of the iteration, @entry will be set * to NULL and @index will have a value less than or equal to max. * * xa_for_each_start() is O(n.log(n)) while xas_for_each() is O(n). You have * to handle your own locking with xas_for_each(), and if you have to unlock * after each iteration, it will also end up being O(n.log(n)). * xa_for_each_start() will spin if it hits a retry entry; if you intend to * see retry entries, you should use the xas_for_each() iterator instead. * The xas_for_each() iterator will expand into more inline code than * xa_for_each_start(). * * Context: Any context. Takes and releases the RCU lock. */ #define xa_for_each_start(xa, index, entry, start) \ xa_for_each_range(xa, index, entry, start, ULONG_MAX) /** * xa_for_each() - Iterate over present entries in an XArray. * @xa: XArray. * @index: Index of @entry. * @entry: Entry retrieved from array. * * During the iteration, @entry will have the value of the entry stored * in @xa at @index. You may modify @index during the iteration if you want * to skip or reprocess indices. It is safe to modify the array during the * iteration. At the end of the iteration, @entry will be set to NULL and * @index will have a value less than or equal to max. * * xa_for_each() is O(n.log(n)) while xas_for_each() is O(n). You have * to handle your own locking with xas_for_each(), and if you have to unlock * after each iteration, it will also end up being O(n.log(n)). xa_for_each() * will spin if it hits a retry entry; if you intend to see retry entries, * you should use the xas_for_each() iterator instead. The xas_for_each() * iterator will expand into more inline code than xa_for_each(). * * Context: Any context. Takes and releases the RCU lock. */ #define xa_for_each(xa, index, entry) \ xa_for_each_start(xa, index, entry, 0) /** * xa_for_each_marked() - Iterate over marked entries in an XArray. * @xa: XArray. * @index: Index of @entry. * @entry: Entry retrieved from array. * @filter: Selection criterion. * * During the iteration, @entry will have the value of the entry stored * in @xa at @index. The iteration will skip all entries in the array * which do not match @filter. You may modify @index during the iteration * if you want to skip or reprocess indices. It is safe to modify the array * during the iteration. At the end of the iteration, @entry will be set to * NULL and @index will have a value less than or equal to max. * * xa_for_each_marked() is O(n.log(n)) while xas_for_each_marked() is O(n). * You have to handle your own locking with xas_for_each(), and if you have * to unlock after each iteration, it will also end up being O(n.log(n)). * xa_for_each_marked() will spin if it hits a retry entry; if you intend to * see retry entries, you should use the xas_for_each_marked() iterator * instead. The xas_for_each_marked() iterator will expand into more inline * code than xa_for_each_marked(). * * Context: Any context. Takes and releases the RCU lock. */ #define xa_for_each_marked(xa, index, entry, filter) \ for (index = 0, entry = xa_find(xa, &index, ULONG_MAX, filter); \ entry; entry = xa_find_after(xa, &index, ULONG_MAX, filter)) #define xa_trylock(xa) spin_trylock(&(xa)->xa_lock) #define xa_lock(xa) spin_lock(&(xa)->xa_lock) #define xa_unlock(xa) spin_unlock(&(xa)->xa_lock) #define xa_lock_bh(xa) spin_lock_bh(&(xa)->xa_lock) #define xa_unlock_bh(xa) spin_unlock_bh(&(xa)->xa_lock) #define xa_lock_irq(xa) spin_lock_irq(&(xa)->xa_lock) #define xa_unlock_irq(xa) spin_unlock_irq(&(xa)->xa_lock) #define xa_lock_irqsave(xa, flags) \ spin_lock_irqsave(&(xa)->xa_lock, flags) #define xa_unlock_irqrestore(xa, flags) \ spin_unlock_irqrestore(&(xa)->xa_lock, flags) #define xa_lock_nested(xa, subclass) \ spin_lock_nested(&(xa)->xa_lock, subclass) #define xa_lock_bh_nested(xa, subclass) \ spin_lock_bh_nested(&(xa)->xa_lock, subclass) #define xa_lock_irq_nested(xa, subclass) \ spin_lock_irq_nested(&(xa)->xa_lock, subclass) #define xa_lock_irqsave_nested(xa, flags, subclass) \ spin_lock_irqsave_nested(&(xa)->xa_lock, flags, subclass) /* * Versions of the normal API which require the caller to hold the * xa_lock. If the GFP flags allow it, they will drop the lock to * allocate memory, then reacquire it afterwards. These functions * may also re-enable interrupts if the XArray flags indicate the * locking should be interrupt safe. */ void *__xa_erase(struct xarray *, unsigned long index); void *__xa_store(struct xarray *, unsigned long index, void *entry, gfp_t); void *__xa_cmpxchg(struct xarray *, unsigned long index, void *old, void *entry, gfp_t); int __must_check __xa_insert(struct xarray *, unsigned long index, void *entry, gfp_t); int __must_check __xa_alloc(struct xarray *, u32 *id, void *entry, struct xa_limit, gfp_t); int __must_check __xa_alloc_cyclic(struct xarray *, u32 *id, void *entry, struct xa_limit, u32 *next, gfp_t); void __xa_set_mark(struct xarray *, unsigned long index, xa_mark_t); void __xa_clear_mark(struct xarray *, unsigned long index, xa_mark_t); /** * xa_store_bh() - Store this entry in the XArray. * @xa: XArray. * @index: Index into array. * @entry: New entry. * @gfp: Memory allocation flags. * * This function is like calling xa_store() except it disables softirqs * while holding the array lock. * * Context: Any context. Takes and releases the xa_lock while * disabling softirqs. * Return: The old entry at this index or xa_err() if an error happened. */ static inline void *xa_store_bh(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp) { void *curr; might_alloc(gfp); xa_lock_bh(xa); curr = __xa_store(xa, index, entry, gfp); xa_unlock_bh(xa); return curr; } /** * xa_store_irq() - Store this entry in the XArray. * @xa: XArray. * @index: Index into array. * @entry: New entry. * @gfp: Memory allocation flags. * * This function is like calling xa_store() except it disables interrupts * while holding the array lock. * * Context: Process context. Takes and releases the xa_lock while * disabling interrupts. * Return: The old entry at this index or xa_err() if an error happened. */ static inline void *xa_store_irq(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp) { void *curr; might_alloc(gfp); xa_lock_irq(xa); curr = __xa_store(xa, index, entry, gfp); xa_unlock_irq(xa); return curr; } /** * xa_erase_bh() - Erase this entry from the XArray. * @xa: XArray. * @index: Index of entry. * * After this function returns, loading from @index will return %NULL. * If the index is part of a multi-index entry, all indices will be erased * and none of the entries will be part of a multi-index entry. * * Context: Any context. Takes and releases the xa_lock while * disabling softirqs. * Return: The entry which used to be at this index. */ static inline void *xa_erase_bh(struct xarray *xa, unsigned long index) { void *entry; xa_lock_bh(xa); entry = __xa_erase(xa, index); xa_unlock_bh(xa); return entry; } /** * xa_erase_irq() - Erase this entry from the XArray. * @xa: XArray. * @index: Index of entry. * * After this function returns, loading from @index will return %NULL. * If the index is part of a multi-index entry, all indices will be erased * and none of the entries will be part of a multi-index entry. * * Context: Process context. Takes and releases the xa_lock while * disabling interrupts. * Return: The entry which used to be at this index. */ static inline void *xa_erase_irq(struct xarray *xa, unsigned long index) { void *entry; xa_lock_irq(xa); entry = __xa_erase(xa, index); xa_unlock_irq(xa); return entry; } /** * xa_cmpxchg() - Conditionally replace an entry in the XArray. * @xa: XArray. * @index: Index into array. * @old: Old value to test against. * @entry: New value to place in array. * @gfp: Memory allocation flags. * * If the entry at @index is the same as @old, replace it with @entry. * If the return value is equal to @old, then the exchange was successful. * * Context: Any context. Takes and releases the xa_lock. May sleep * if the @gfp flags permit. * Return: The old value at this index or xa_err() if an error happened. */ static inline void *xa_cmpxchg(struct xarray *xa, unsigned long index, void *old, void *entry, gfp_t gfp) { void *curr; might_alloc(gfp); xa_lock(xa); curr = __xa_cmpxchg(xa, index, old, entry, gfp); xa_unlock(xa); return curr; } /** * xa_cmpxchg_bh() - Conditionally replace an entry in the XArray. * @xa: XArray. * @index: Index into array. * @old: Old value to test against. * @entry: New value to place in array. * @gfp: Memory allocation flags. * * This function is like calling xa_cmpxchg() except it disables softirqs * while holding the array lock. * * Context: Any context. Takes and releases the xa_lock while * disabling softirqs. May sleep if the @gfp flags permit. * Return: The old value at this index or xa_err() if an error happened. */ static inline void *xa_cmpxchg_bh(struct xarray *xa, unsigned long index, void *old, void *entry, gfp_t gfp) { void *curr; might_alloc(gfp); xa_lock_bh(xa); curr = __xa_cmpxchg(xa, index, old, entry, gfp); xa_unlock_bh(xa); return curr; } /** * xa_cmpxchg_irq() - Conditionally replace an entry in the XArray. * @xa: XArray. * @index: Index into array. * @old: Old value to test against. * @entry: New value to place in array. * @gfp: Memory allocation flags. * * This function is like calling xa_cmpxchg() except it disables interrupts * while holding the array lock. * * Context: Process context. Takes and releases the xa_lock while * disabling interrupts. May sleep if the @gfp flags permit. * Return: The old value at this index or xa_err() if an error happened. */ static inline void *xa_cmpxchg_irq(struct xarray *xa, unsigned long index, void *old, void *entry, gfp_t gfp) { void *curr; might_alloc(gfp); xa_lock_irq(xa); curr = __xa_cmpxchg(xa, index, old, entry, gfp); xa_unlock_irq(xa); return curr; } /** * xa_insert() - Store this entry in the XArray unless another entry is * already present. * @xa: XArray. * @index: Index into array. * @entry: New entry. * @gfp: Memory allocation flags. * * Inserting a NULL entry will store a reserved entry (like xa_reserve()) * if no entry is present. Inserting will fail if a reserved entry is * present, even though loading from this index will return NULL. * * Context: Any context. Takes and releases the xa_lock. May sleep if * the @gfp flags permit. * Return: 0 if the store succeeded. -EBUSY if another entry was present. * -ENOMEM if memory could not be allocated. */ static inline int __must_check xa_insert(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp) { int err; might_alloc(gfp); xa_lock(xa); err = __xa_insert(xa, index, entry, gfp); xa_unlock(xa); return err; } /** * xa_insert_bh() - Store this entry in the XArray unless another entry is * already present. * @xa: XArray. * @index: Index into array. * @entry: New entry. * @gfp: Memory allocation flags. * * Inserting a NULL entry will store a reserved entry (like xa_reserve()) * if no entry is present. Inserting will fail if a reserved entry is * present, even though loading from this index will return NULL. * * Context: Any context. Takes and releases the xa_lock while * disabling softirqs. May sleep if the @gfp flags permit. * Return: 0 if the store succeeded. -EBUSY if another entry was present. * -ENOMEM if memory could not be allocated. */ static inline int __must_check xa_insert_bh(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp) { int err; might_alloc(gfp); xa_lock_bh(xa); err = __xa_insert(xa, index, entry, gfp); xa_unlock_bh(xa); return err; } /** * xa_insert_irq() - Store this entry in the XArray unless another entry is * already present. * @xa: XArray. * @index: Index into array. * @entry: New entry. * @gfp: Memory allocation flags. * * Inserting a NULL entry will store a reserved entry (like xa_reserve()) * if no entry is present. Inserting will fail if a reserved entry is * present, even though loading from this index will return NULL. * * Context: Process context. Takes and releases the xa_lock while * disabling interrupts. May sleep if the @gfp flags permit. * Return: 0 if the store succeeded. -EBUSY if another entry was present. * -ENOMEM if memory could not be allocated. */ static inline int __must_check xa_insert_irq(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp) { int err; might_alloc(gfp); xa_lock_irq(xa); err = __xa_insert(xa, index, entry, gfp); xa_unlock_irq(xa); return err; } /** * xa_alloc() - Find somewhere to store this entry in the XArray. * @xa: XArray. * @id: Pointer to ID. * @entry: New entry. * @limit: Range of ID to allocate. * @gfp: Memory allocation flags. * * Finds an empty entry in @xa between @limit.min and @limit.max, * stores the index into the @id pointer, then stores the entry at * that index. A concurrent lookup will not see an uninitialised @id. * * Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set * in xa_init_flags(). * * Context: Any context. Takes and releases the xa_lock. May sleep if * the @gfp flags permit. * Return: 0 on success, -ENOMEM if memory could not be allocated or * -EBUSY if there are no free entries in @limit. */ static inline __must_check int xa_alloc(struct xarray *xa, u32 *id, void *entry, struct xa_limit limit, gfp_t gfp) { int err; might_alloc(gfp); xa_lock(xa); err = __xa_alloc(xa, id, entry, limit, gfp); xa_unlock(xa); return err; } /** * xa_alloc_bh() - Find somewhere to store this entry in the XArray. * @xa: XArray. * @id: Pointer to ID. * @entry: New entry. * @limit: Range of ID to allocate. * @gfp: Memory allocation flags. * * Finds an empty entry in @xa between @limit.min and @limit.max, * stores the index into the @id pointer, then stores the entry at * that index. A concurrent lookup will not see an uninitialised @id. * * Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set * in xa_init_flags(). * * Context: Any context. Takes and releases the xa_lock while * disabling softirqs. May sleep if the @gfp flags permit. * Return: 0 on success, -ENOMEM if memory could not be allocated or * -EBUSY if there are no free entries in @limit. */ static inline int __must_check xa_alloc_bh(struct xarray *xa, u32 *id, void *entry, struct xa_limit limit, gfp_t gfp) { int err; might_alloc(gfp); xa_lock_bh(xa); err = __xa_alloc(xa, id, entry, limit, gfp); xa_unlock_bh(xa); return err; } /** * xa_alloc_irq() - Find somewhere to store this entry in the XArray. * @xa: XArray. * @id: Pointer to ID. * @entry: New entry. * @limit: Range of ID to allocate. * @gfp: Memory allocation flags. * * Finds an empty entry in @xa between @limit.min and @limit.max, * stores the index into the @id pointer, then stores the entry at * that index. A concurrent lookup will not see an uninitialised @id. * * Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set * in xa_init_flags(). * * Context: Process context. Takes and releases the xa_lock while * disabling interrupts. May sleep if the @gfp flags permit. * Return: 0 on success, -ENOMEM if memory could not be allocated or * -EBUSY if there are no free entries in @limit. */ static inline int __must_check xa_alloc_irq(struct xarray *xa, u32 *id, void *entry, struct xa_limit limit, gfp_t gfp) { int err; might_alloc(gfp); xa_lock_irq(xa); err = __xa_alloc(xa, id, entry, limit, gfp); xa_unlock_irq(xa); return err; } /** * xa_alloc_cyclic() - Find somewhere to store this entry in the XArray. * @xa: XArray. * @id: Pointer to ID. * @entry: New entry. * @limit: Range of allocated ID. * @next: Pointer to next ID to allocate. * @gfp: Memory allocation flags. * * Finds an empty entry in @xa between @limit.min and @limit.max, * stores the index into the @id pointer, then stores the entry at * that index. A concurrent lookup will not see an uninitialised @id. * The search for an empty entry will start at @next and will wrap * around if necessary. * * Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set * in xa_init_flags(). * * Note that callers interested in whether wrapping has occurred should * use __xa_alloc_cyclic() instead. * * Context: Any context. Takes and releases the xa_lock. May sleep if * the @gfp flags permit. * Return: 0 if the allocation succeeded, -ENOMEM if memory could not be * allocated or -EBUSY if there are no free entries in @limit. */ static inline int xa_alloc_cyclic(struct xarray *xa, u32 *id, void *entry, struct xa_limit limit, u32 *next, gfp_t gfp) { int err; might_alloc(gfp); xa_lock(xa); err = __xa_alloc_cyclic(xa, id, entry, limit, next, gfp); xa_unlock(xa); return err < 0 ? err : 0; } /** * xa_alloc_cyclic_bh() - Find somewhere to store this entry in the XArray. * @xa: XArray. * @id: Pointer to ID. * @entry: New entry. * @limit: Range of allocated ID. * @next: Pointer to next ID to allocate. * @gfp: Memory allocation flags. * * Finds an empty entry in @xa between @limit.min and @limit.max, * stores the index into the @id pointer, then stores the entry at * that index. A concurrent lookup will not see an uninitialised @id. * The search for an empty entry will start at @next and will wrap * around if necessary. * * Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set * in xa_init_flags(). * * Note that callers interested in whether wrapping has occurred should * use __xa_alloc_cyclic() instead. * * Context: Any context. Takes and releases the xa_lock while * disabling softirqs. May sleep if the @gfp flags permit. * Return: 0 if the allocation succeeded, -ENOMEM if memory could not be * allocated or -EBUSY if there are no free entries in @limit. */ static inline int xa_alloc_cyclic_bh(struct xarray *xa, u32 *id, void *entry, struct xa_limit limit, u32 *next, gfp_t gfp) { int err; might_alloc(gfp); xa_lock_bh(xa); err = __xa_alloc_cyclic(xa, id, entry, limit, next, gfp); xa_unlock_bh(xa); return err < 0 ? err : 0; } /** * xa_alloc_cyclic_irq() - Find somewhere to store this entry in the XArray. * @xa: XArray. * @id: Pointer to ID. * @entry: New entry. * @limit: Range of allocated ID. * @next: Pointer to next ID to allocate. * @gfp: Memory allocation flags. * * Finds an empty entry in @xa between @limit.min and @limit.max, * stores the index into the @id pointer, then stores the entry at * that index. A concurrent lookup will not see an uninitialised @id. * The search for an empty entry will start at @next and will wrap * around if necessary. * * Must only be operated on an xarray initialized with flag XA_FLAGS_ALLOC set * in xa_init_flags(). * * Note that callers interested in whether wrapping has occurred should * use __xa_alloc_cyclic() instead. * * Context: Process context. Takes and releases the xa_lock while * disabling interrupts. May sleep if the @gfp flags permit. * Return: 0 if the allocation succeeded, -ENOMEM if memory could not be * allocated or -EBUSY if there are no free entries in @limit. */ static inline int xa_alloc_cyclic_irq(struct xarray *xa, u32 *id, void *entry, struct xa_limit limit, u32 *next, gfp_t gfp) { int err; might_alloc(gfp); xa_lock_irq(xa); err = __xa_alloc_cyclic(xa, id, entry, limit, next, gfp); xa_unlock_irq(xa); return err < 0 ? err : 0; } /** * xa_reserve() - Reserve this index in the XArray. * @xa: XArray. * @index: Index into array. * @gfp: Memory allocation flags. * * Ensures there is somewhere to store an entry at @index in the array. * If there is already something stored at @index, this function does * nothing. If there was nothing there, the entry is marked as reserved. * Loading from a reserved entry returns a %NULL pointer. * * If you do not use the entry that you have reserved, call xa_release() * or xa_erase() to free any unnecessary memory. * * Context: Any context. Takes and releases the xa_lock. * May sleep if the @gfp flags permit. * Return: 0 if the reservation succeeded or -ENOMEM if it failed. */ static inline __must_check int xa_reserve(struct xarray *xa, unsigned long index, gfp_t gfp) { return xa_err(xa_cmpxchg(xa, index, NULL, XA_ZERO_ENTRY, gfp)); } /** * xa_reserve_bh() - Reserve this index in the XArray. * @xa: XArray. * @index: Index into array. * @gfp: Memory allocation flags. * * A softirq-disabling version of xa_reserve(). * * Context: Any context. Takes and releases the xa_lock while * disabling softirqs. * Return: 0 if the reservation succeeded or -ENOMEM if it failed. */ static inline __must_check int xa_reserve_bh(struct xarray *xa, unsigned long index, gfp_t gfp) { return xa_err(xa_cmpxchg_bh(xa, index, NULL, XA_ZERO_ENTRY, gfp)); } /** * xa_reserve_irq() - Reserve this index in the XArray. * @xa: XArray. * @index: Index into array. * @gfp: Memory allocation flags. * * An interrupt-disabling version of xa_reserve(). * * Context: Process context. Takes and releases the xa_lock while * disabling interrupts. * Return: 0 if the reservation succeeded or -ENOMEM if it failed. */ static inline __must_check int xa_reserve_irq(struct xarray *xa, unsigned long index, gfp_t gfp) { return xa_err(xa_cmpxchg_irq(xa, index, NULL, XA_ZERO_ENTRY, gfp)); } /** * xa_release() - Release a reserved entry. * @xa: XArray. * @index: Index of entry. * * After calling xa_reserve(), you can call this function to release the * reservation. If the entry at @index has been stored to, this function * will do nothing. */ static inline void xa_release(struct xarray *xa, unsigned long index) { xa_cmpxchg(xa, index, XA_ZERO_ENTRY, NULL, 0); } /* Everything below here is the Advanced API. Proceed with caution. */ /* * The xarray is constructed out of a set of 'chunks' of pointers. Choosing * the best chunk size requires some tradeoffs. A power of two recommends * itself so that we can walk the tree based purely on shifts and masks. * Generally, the larger the better; as the number of slots per level of the * tree increases, the less tall the tree needs to be. But that needs to be * balanced against the memory consumption of each node. On a 64-bit system, * xa_node is currently 576 bytes, and we get 7 of them per 4kB page. If we * doubled the number of slots per node, we'd get only 3 nodes per 4kB page. */ #ifndef XA_CHUNK_SHIFT #define XA_CHUNK_SHIFT (IS_ENABLED(CONFIG_BASE_SMALL) ? 4 : 6) #endif #define XA_CHUNK_SIZE (1UL << XA_CHUNK_SHIFT) #define XA_CHUNK_MASK (XA_CHUNK_SIZE - 1) #define XA_MAX_MARKS 3 #define XA_MARK_LONGS BITS_TO_LONGS(XA_CHUNK_SIZE) /* * @count is the count of every non-NULL element in the ->slots array * whether that is a value entry, a retry entry, a user pointer, * a sibling entry or a pointer to the next level of the tree. * @nr_values is the count of every element in ->slots which is * either a value entry or a sibling of a value entry. */ struct xa_node { unsigned char shift; /* Bits remaining in each slot */ unsigned char offset; /* Slot offset in parent */ unsigned char count; /* Total entry count */ unsigned char nr_values; /* Value entry count */ struct xa_node __rcu *parent; /* NULL at top of tree */ struct xarray *array; /* The array we belong to */ union { struct list_head private_list; /* For tree user */ struct rcu_head rcu_head; /* Used when freeing node */ }; void __rcu *slots[XA_CHUNK_SIZE]; union { unsigned long tags[XA_MAX_MARKS][XA_MARK_LONGS]; unsigned long marks[XA_MAX_MARKS][XA_MARK_LONGS]; }; }; void xa_dump(const struct xarray *); void xa_dump_node(const struct xa_node *); #ifdef XA_DEBUG #define XA_BUG_ON(xa, x) do { \ if (x) { \ xa_dump(xa); \ BUG(); \ } \ } while (0) #define XA_NODE_BUG_ON(node, x) do { \ if (x) { \ if (node) xa_dump_node(node); \ BUG(); \ } \ } while (0) #else #define XA_BUG_ON(xa, x) do { } while (0) #define XA_NODE_BUG_ON(node, x) do { } while (0) #endif /* Private */ static inline void *xa_head(const struct xarray *xa) { return rcu_dereference_check(xa->xa_head, lockdep_is_held(&xa->xa_lock)); } /* Private */ static inline void *xa_head_locked(const struct xarray *xa) { return rcu_dereference_protected(xa->xa_head, lockdep_is_held(&xa->xa_lock)); } /* Private */ static inline void *xa_entry(const struct xarray *xa, const struct xa_node *node, unsigned int offset) { XA_NODE_BUG_ON(node, offset >= XA_CHUNK_SIZE); return rcu_dereference_check(node->slots[offset], lockdep_is_held(&xa->xa_lock)); } /* Private */ static inline void *xa_entry_locked(const struct xarray *xa, const struct xa_node *node, unsigned int offset) { XA_NODE_BUG_ON(node, offset >= XA_CHUNK_SIZE); return rcu_dereference_protected(node->slots[offset], lockdep_is_held(&xa->xa_lock)); } /* Private */ static inline struct xa_node *xa_parent(const struct xarray *xa, const struct xa_node *node) { return rcu_dereference_check(node->parent, lockdep_is_held(&xa->xa_lock)); } /* Private */ static inline struct xa_node *xa_parent_locked(const struct xarray *xa, const struct xa_node *node) { return rcu_dereference_protected(node->parent, lockdep_is_held(&xa->xa_lock)); } /* Private */ static inline void *xa_mk_node(const struct xa_node *node) { return (void *)((unsigned long)node | 2); } /* Private */ static inline struct xa_node *xa_to_node(const void *entry) { return (struct xa_node *)((unsigned long)entry - 2); } /* Private */ static inline bool xa_is_node(const void *entry) { return xa_is_internal(entry) && (unsigned long)entry > 4096; } /* Private */ static inline void *xa_mk_sibling(unsigned int offset) { return xa_mk_internal(offset); } /* Private */ static inline unsigned long xa_to_sibling(const void *entry) { return xa_to_internal(entry); } /** * xa_is_sibling() - Is the entry a sibling entry? * @entry: Entry retrieved from the XArray * * Return: %true if the entry is a sibling entry. */ static inline bool xa_is_sibling(const void *entry) { return IS_ENABLED(CONFIG_XARRAY_MULTI) && xa_is_internal(entry) && (entry < xa_mk_sibling(XA_CHUNK_SIZE - 1)); } #define XA_RETRY_ENTRY xa_mk_internal(256) /** * xa_is_retry() - Is the entry a retry entry? * @entry: Entry retrieved from the XArray * * Return: %true if the entry is a retry entry. */ static inline bool xa_is_retry(const void *entry) { return unlikely(entry == XA_RETRY_ENTRY); } /** * xa_is_advanced() - Is the entry only permitted for the advanced API? * @entry: Entry to be stored in the XArray. * * Return: %true if the entry cannot be stored by the normal API. */ static inline bool xa_is_advanced(const void *entry) { return xa_is_internal(entry) && (entry <= XA_RETRY_ENTRY); } /** * typedef xa_update_node_t - A callback function from the XArray. * @node: The node which is being processed * * This function is called every time the XArray updates the count of * present and value entries in a node. It allows advanced users to * maintain the private_list in the node. * * Context: The xa_lock is held and interrupts may be disabled. * Implementations should not drop the xa_lock, nor re-enable * interrupts. */ typedef void (*xa_update_node_t)(struct xa_node *node); void xa_delete_node(struct xa_node *, xa_update_node_t); /* * The xa_state is opaque to its users. It contains various different pieces * of state involved in the current operation on the XArray. It should be * declared on the stack and passed between the various internal routines. * The various elements in it should not be accessed directly, but only * through the provided accessor functions. The below documentation is for * the benefit of those working on the code, not for users of the XArray. * * @xa_node usually points to the xa_node containing the slot we're operating * on (and @xa_offset is the offset in the slots array). If there is a * single entry in the array at index 0, there are no allocated xa_nodes to * point to, and so we store %NULL in @xa_node. @xa_node is set to * the value %XAS_RESTART if the xa_state is not walked to the correct * position in the tree of nodes for this operation. If an error occurs * during an operation, it is set to an %XAS_ERROR value. If we run off the * end of the allocated nodes, it is set to %XAS_BOUNDS. */ struct xa_state { struct xarray *xa; unsigned long xa_index; unsigned char xa_shift; unsigned char xa_sibs; unsigned char xa_offset; unsigned char xa_pad; /* Helps gcc generate better code */ struct xa_node *xa_node; struct xa_node *xa_alloc; xa_update_node_t xa_update; struct list_lru *xa_lru; }; /* * We encode errnos in the xas->xa_node. If an error has happened, we need to * drop the lock to fix it, and once we've done so the xa_state is invalid. */ #define XA_ERROR(errno) ((struct xa_node *)(((unsigned long)errno << 2) | 2UL)) #define XAS_BOUNDS ((struct xa_node *)1UL) #define XAS_RESTART ((struct xa_node *)3UL) #define __XA_STATE(array, index, shift, sibs) { \ .xa = array, \ .xa_index = index, \ .xa_shift = shift, \ .xa_sibs = sibs, \ .xa_offset = 0, \ .xa_pad = 0, \ .xa_node = XAS_RESTART, \ .xa_alloc = NULL, \ .xa_update = NULL, \ .xa_lru = NULL, \ } /** * XA_STATE() - Declare an XArray operation state. * @name: Name of this operation state (usually xas). * @array: Array to operate on. * @index: Initial index of interest. * * Declare and initialise an xa_state on the stack. */ #define XA_STATE(name, array, index) \ struct xa_state name = __XA_STATE(array, index, 0, 0) /** * XA_STATE_ORDER() - Declare an XArray operation state. * @name: Name of this operation state (usually xas). * @array: Array to operate on. * @index: Initial index of interest. * @order: Order of entry. * * Declare and initialise an xa_state on the stack. This variant of * XA_STATE() allows you to specify the 'order' of the element you * want to operate on.` */ #define XA_STATE_ORDER(name, array, index, order) \ struct xa_state name = __XA_STATE(array, \ (index >> order) << order, \ order - (order % XA_CHUNK_SHIFT), \ (1U << (order % XA_CHUNK_SHIFT)) - 1) #define xas_marked(xas, mark) xa_marked((xas)->xa, (mark)) #define xas_trylock(xas) xa_trylock((xas)->xa) #define xas_lock(xas) xa_lock((xas)->xa) #define xas_unlock(xas) xa_unlock((xas)->xa) #define xas_lock_bh(xas) xa_lock_bh((xas)->xa) #define xas_unlock_bh(xas) xa_unlock_bh((xas)->xa) #define xas_lock_irq(xas) xa_lock_irq((xas)->xa) #define xas_unlock_irq(xas) xa_unlock_irq((xas)->xa) #define xas_lock_irqsave(xas, flags) \ xa_lock_irqsave((xas)->xa, flags) #define xas_unlock_irqrestore(xas, flags) \ xa_unlock_irqrestore((xas)->xa, flags) /** * xas_error() - Return an errno stored in the xa_state. * @xas: XArray operation state. * * Return: 0 if no error has been noted. A negative errno if one has. */ static inline int xas_error(const struct xa_state *xas) { return xa_err(xas->xa_node); } /** * xas_set_err() - Note an error in the xa_state. * @xas: XArray operation state. * @err: Negative error number. * * Only call this function with a negative @err; zero or positive errors * will probably not behave the way you think they should. If you want * to clear the error from an xa_state, use xas_reset(). */ static inline void xas_set_err(struct xa_state *xas, long err) { xas->xa_node = XA_ERROR(err); } /** * xas_invalid() - Is the xas in a retry or error state? * @xas: XArray operation state. * * Return: %true if the xas cannot be used for operations. */ static inline bool xas_invalid(const struct xa_state *xas) { return (unsigned long)xas->xa_node & 3; } /** * xas_valid() - Is the xas a valid cursor into the array? * @xas: XArray operation state. * * Return: %true if the xas can be used for operations. */ static inline bool xas_valid(const struct xa_state *xas) { return !xas_invalid(xas); } /** * xas_is_node() - Does the xas point to a node? * @xas: XArray operation state. * * Return: %true if the xas currently references a node. */ static inline bool xas_is_node(const struct xa_state *xas) { return xas_valid(xas) && xas->xa_node; } /* True if the pointer is something other than a node */ static inline bool xas_not_node(struct xa_node *node) { return ((unsigned long)node & 3) || !node; } /* True if the node represents RESTART or an error */ static inline bool xas_frozen(struct xa_node *node) { return (unsigned long)node & 2; } /* True if the node represents head-of-tree, RESTART or BOUNDS */ static inline bool xas_top(struct xa_node *node) { return node <= XAS_RESTART; } /** * xas_reset() - Reset an XArray operation state. * @xas: XArray operation state. * * Resets the error or walk state of the @xas so future walks of the * array will start from the root. Use this if you have dropped the * xarray lock and want to reuse the xa_state. * * Context: Any context. */ static inline void xas_reset(struct xa_state *xas) { xas->xa_node = XAS_RESTART; } /** * xas_retry() - Retry the operation if appropriate. * @xas: XArray operation state. * @entry: Entry from xarray. * * The advanced functions may sometimes return an internal entry, such as * a retry entry or a zero entry. This function sets up the @xas to restart * the walk from the head of the array if needed. * * Context: Any context. * Return: true if the operation needs to be retried. */ static inline bool xas_retry(struct xa_state *xas, const void *entry) { if (xa_is_zero(entry)) return true; if (!xa_is_retry(entry)) return false; xas_reset(xas); return true; } void *xas_load(struct xa_state *); void *xas_store(struct xa_state *, void *entry); void *xas_find(struct xa_state *, unsigned long max); void *xas_find_conflict(struct xa_state *); bool xas_get_mark(const struct xa_state *, xa_mark_t); void xas_set_mark(const struct xa_state *, xa_mark_t); void xas_clear_mark(const struct xa_state *, xa_mark_t); void *xas_find_marked(struct xa_state *, unsigned long max, xa_mark_t); void xas_init_marks(const struct xa_state *); bool xas_nomem(struct xa_state *, gfp_t); void xas_destroy(struct xa_state *); void xas_pause(struct xa_state *); void xas_create_range(struct xa_state *); #ifdef CONFIG_XARRAY_MULTI int xa_get_order(struct xarray *, unsigned long index); int xas_get_order(struct xa_state *xas); void xas_split(struct xa_state *, void *entry, unsigned int order); void xas_split_alloc(struct xa_state *, void *entry, unsigned int order, gfp_t); void xas_try_split(struct xa_state *xas, void *entry, unsigned int order); unsigned int xas_try_split_min_order(unsigned int order); #else static inline int xa_get_order(struct xarray *xa, unsigned long index) { return 0; } static inline int xas_get_order(struct xa_state *xas) { return 0; } static inline void xas_split(struct xa_state *xas, void *entry, unsigned int order) { xas_store(xas, entry); } static inline void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order, gfp_t gfp) { } static inline void xas_try_split(struct xa_state *xas, void *entry, unsigned int order) { } static inline unsigned int xas_try_split_min_order(unsigned int order) { return 0; } #endif /** * xas_reload() - Refetch an entry from the xarray. * @xas: XArray operation state. * * Use this function to check that a previously loaded entry still has * the same value. This is useful for the lockless pagecache lookup where * we walk the array with only the RCU lock to protect us, lock the page, * then check that the page hasn't moved since we looked it up. * * The caller guarantees that @xas is still valid. If it may be in an * error or restart state, call xas_load() instead. * * Return: The entry at this location in the xarray. */ static inline void *xas_reload(struct xa_state *xas) { struct xa_node *node = xas->xa_node; void *entry; char offset; if (!node) return xa_head(xas->xa); if (IS_ENABLED(CONFIG_XARRAY_MULTI)) { offset = (xas->xa_index >> node->shift) & XA_CHUNK_MASK; entry = xa_entry(xas->xa, node, offset); if (!xa_is_sibling(entry)) return entry; offset = xa_to_sibling(entry); } else { offset = xas->xa_offset; } return xa_entry(xas->xa, node, offset); } /** * xas_set() - Set up XArray operation state for a different index. * @xas: XArray operation state. * @index: New index into the XArray. * * Move the operation state to refer to a different index. This will * have the effect of starting a walk from the top; see xas_next() * to move to an adjacent index. */ static inline void xas_set(struct xa_state *xas, unsigned long index) { xas->xa_index = index; xas->xa_node = XAS_RESTART; } /** * xas_advance() - Skip over sibling entries. * @xas: XArray operation state. * @index: Index of last sibling entry. * * Move the operation state to refer to the last sibling entry. * This is useful for loops that normally want to see sibling * entries but sometimes want to skip them. Use xas_set() if you * want to move to an index which is not part of this entry. */ static inline void xas_advance(struct xa_state *xas, unsigned long index) { unsigned char shift = xas_is_node(xas) ? xas->xa_node->shift : 0; xas->xa_index = index; xas->xa_offset = (index >> shift) & XA_CHUNK_MASK; } /** * xas_set_order() - Set up XArray operation state for a multislot entry. * @xas: XArray operation state. * @index: Target of the operation. * @order: Entry occupies 2^@order indices. */ static inline void xas_set_order(struct xa_state *xas, unsigned long index, unsigned int order) { #ifdef CONFIG_XARRAY_MULTI xas->xa_index = order < BITS_PER_LONG ? (index >> order) << order : 0; xas->xa_shift = order - (order % XA_CHUNK_SHIFT); xas->xa_sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1; xas->xa_node = XAS_RESTART; #else BUG_ON(order > 0); xas_set(xas, index); #endif } /** * xas_set_update() - Set up XArray operation state for a callback. * @xas: XArray operation state. * @update: Function to call when updating a node. * * The XArray can notify a caller after it has updated an xa_node. * This is advanced functionality and is only needed by the page * cache and swap cache. */ static inline void xas_set_update(struct xa_state *xas, xa_update_node_t update) { xas->xa_update = update; } static inline void xas_set_lru(struct xa_state *xas, struct list_lru *lru) { xas->xa_lru = lru; } /** * xas_next_entry() - Advance iterator to next present entry. * @xas: XArray operation state. * @max: Highest index to return. * * xas_next_entry() is an inline function to optimise xarray traversal for * speed. It is equivalent to calling xas_find(), and will call xas_find() * for all the hard cases. * * Return: The next present entry after the one currently referred to by @xas. */ static inline void *xas_next_entry(struct xa_state *xas, unsigned long max) { struct xa_node *node = xas->xa_node; void *entry; if (unlikely(xas_not_node(node) || node->shift || xas->xa_offset != (xas->xa_index & XA_CHUNK_MASK))) return xas_find(xas, max); do { if (unlikely(xas->xa_index >= max)) return xas_find(xas, max); if (unlikely(xas->xa_offset == XA_CHUNK_MASK)) return xas_find(xas, max); entry = xa_entry(xas->xa, node, xas->xa_offset + 1); if (unlikely(xa_is_internal(entry))) return xas_find(xas, max); xas->xa_offset++; xas->xa_index++; } while (!entry); return entry; } /* Private */ static inline unsigned int xas_find_chunk(struct xa_state *xas, bool advance, xa_mark_t mark) { unsigned long *addr = xas->xa_node->marks[(__force unsigned)mark]; unsigned int offset = xas->xa_offset; if (advance) offset++; if (XA_CHUNK_SIZE == BITS_PER_LONG) { if (offset < XA_CHUNK_SIZE) { unsigned long data = *addr & (~0UL << offset); if (data) return __ffs(data); } return XA_CHUNK_SIZE; } return find_next_bit(addr, XA_CHUNK_SIZE, offset); } /** * xas_next_marked() - Advance iterator to next marked entry. * @xas: XArray operation state. * @max: Highest index to return. * @mark: Mark to search for. * * xas_next_marked() is an inline function to optimise xarray traversal for * speed. It is equivalent to calling xas_find_marked(), and will call * xas_find_marked() for all the hard cases. * * Return: The next marked entry after the one currently referred to by @xas. */ static inline void *xas_next_marked(struct xa_state *xas, unsigned long max, xa_mark_t mark) { struct xa_node *node = xas->xa_node; void *entry; unsigned int offset; if (unlikely(xas_not_node(node) || node->shift)) return xas_find_marked(xas, max, mark); offset = xas_find_chunk(xas, true, mark); xas->xa_offset = offset; xas->xa_index = (xas->xa_index & ~XA_CHUNK_MASK) + offset; if (xas->xa_index > max) return NULL; if (offset == XA_CHUNK_SIZE) return xas_find_marked(xas, max, mark); entry = xa_entry(xas->xa, node, offset); if (!entry) return xas_find_marked(xas, max, mark); return entry; } /* * If iterating while holding a lock, drop the lock and reschedule * every %XA_CHECK_SCHED loops. */ enum { XA_CHECK_SCHED = 4096, }; /** * xas_for_each() - Iterate over a range of an XArray. * @xas: XArray operation state. * @entry: Entry retrieved from the array. * @max: Maximum index to retrieve from array. * * The loop body will be executed for each entry present in the xarray * between the current xas position and @max. @entry will be set to * the entry retrieved from the xarray. It is safe to delete entries * from the array in the loop body. You should hold either the RCU lock * or the xa_lock while iterating. If you need to drop the lock, call * xas_pause() first. */ #define xas_for_each(xas, entry, max) \ for (entry = xas_find(xas, max); entry; \ entry = xas_next_entry(xas, max)) /** * xas_for_each_marked() - Iterate over a range of an XArray. * @xas: XArray operation state. * @entry: Entry retrieved from the array. * @max: Maximum index to retrieve from array. * @mark: Mark to search for. * * The loop body will be executed for each marked entry in the xarray * between the current xas position and @max. @entry will be set to * the entry retrieved from the xarray. It is safe to delete entries * from the array in the loop body. You should hold either the RCU lock * or the xa_lock while iterating. If you need to drop the lock, call * xas_pause() first. */ #define xas_for_each_marked(xas, entry, max, mark) \ for (entry = xas_find_marked(xas, max, mark); entry; \ entry = xas_next_marked(xas, max, mark)) /** * xas_for_each_conflict() - Iterate over a range of an XArray. * @xas: XArray operation state. * @entry: Entry retrieved from the array. * * The loop body will be executed for each entry in the XArray that * lies within the range specified by @xas. If the loop terminates * normally, @entry will be %NULL. The user may break out of the loop, * which will leave @entry set to the conflicting entry. The caller * may also call xa_set_err() to exit the loop while setting an error * to record the reason. */ #define xas_for_each_conflict(xas, entry) \ while ((entry = xas_find_conflict(xas))) void *__xas_next(struct xa_state *); void *__xas_prev(struct xa_state *); /** * xas_prev() - Move iterator to previous index. * @xas: XArray operation state. * * If the @xas was in an error state, it will remain in an error state * and this function will return %NULL. If the @xas has never been walked, * it will have the effect of calling xas_load(). Otherwise one will be * subtracted from the index and the state will be walked to the correct * location in the array for the next operation. * * If the iterator was referencing index 0, this function wraps * around to %ULONG_MAX. * * Return: The entry at the new index. This may be %NULL or an internal * entry. */ static inline void *xas_prev(struct xa_state *xas) { struct xa_node *node = xas->xa_node; if (unlikely(xas_not_node(node) || node->shift || xas->xa_offset == 0)) return __xas_prev(xas); xas->xa_index--; xas->xa_offset--; return xa_entry(xas->xa, node, xas->xa_offset); } /** * xas_next() - Move state to next index. * @xas: XArray operation state. * * If the @xas was in an error state, it will remain in an error state * and this function will return %NULL. If the @xas has never been walked, * it will have the effect of calling xas_load(). Otherwise one will be * added to the index and the state will be walked to the correct * location in the array for the next operation. * * If the iterator was referencing index %ULONG_MAX, this function wraps * around to 0. * * Return: The entry at the new index. This may be %NULL or an internal * entry. */ static inline void *xas_next(struct xa_state *xas) { struct xa_node *node = xas->xa_node; if (unlikely(xas_not_node(node) || node->shift || xas->xa_offset == XA_CHUNK_MASK)) return __xas_next(xas); xas->xa_index++; xas->xa_offset++; return xa_entry(xas->xa, node, xas->xa_offset); } #endif /* _LINUX_XARRAY_H */ |
| 35 7 7 7 8 7 7 7 7 7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | /* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */ /* * linux/can/skb.h * * Definitions for the CAN network socket buffer * * Copyright (C) 2012 Oliver Hartkopp <socketcan@hartkopp.net> * */ #ifndef _CAN_SKB_H #define _CAN_SKB_H #include <linux/types.h> #include <linux/skbuff.h> #include <linux/can.h> #include <net/sock.h> void can_flush_echo_skb(struct net_device *dev); int can_put_echo_skb(struct sk_buff *skb, struct net_device *dev, unsigned int idx, unsigned int frame_len); struct sk_buff *__can_get_echo_skb(struct net_device *dev, unsigned int idx, unsigned int *len_ptr, unsigned int *frame_len_ptr); unsigned int __must_check can_get_echo_skb(struct net_device *dev, unsigned int idx, unsigned int *frame_len_ptr); void can_free_echo_skb(struct net_device *dev, unsigned int idx, unsigned int *frame_len_ptr); struct sk_buff *alloc_can_skb(struct net_device *dev, struct can_frame **cf); struct sk_buff *alloc_canfd_skb(struct net_device *dev, struct canfd_frame **cfd); struct sk_buff *alloc_canxl_skb(struct net_device *dev, struct canxl_frame **cxl, unsigned int data_len); struct sk_buff *alloc_can_err_skb(struct net_device *dev, struct can_frame **cf); bool can_dropped_invalid_skb(struct net_device *dev, struct sk_buff *skb); /* * The struct can_skb_priv is used to transport additional information along * with the stored struct can(fd)_frame that can not be contained in existing * struct sk_buff elements. * N.B. that this information must not be modified in cloned CAN sk_buffs. * To modify the CAN frame content or the struct can_skb_priv content * skb_copy() needs to be used instead of skb_clone(). */ /** * struct can_skb_priv - private additional data inside CAN sk_buffs * @ifindex: ifindex of the first interface the CAN frame appeared on * @skbcnt: atomic counter to have an unique id together with skb pointer * @frame_len: length of CAN frame in data link layer * @cf: align to the following CAN frame at skb->data */ struct can_skb_priv { int ifindex; int skbcnt; unsigned int frame_len; struct can_frame cf[]; }; static inline struct can_skb_priv *can_skb_prv(struct sk_buff *skb) { return (struct can_skb_priv *)(skb->head); } static inline void can_skb_reserve(struct sk_buff *skb) { skb_reserve(skb, sizeof(struct can_skb_priv)); } static inline void can_skb_set_owner(struct sk_buff *skb, struct sock *sk) { /* If the socket has already been closed by user space, the * refcount may already be 0 (and the socket will be freed * after the last TX skb has been freed). So only increase * socket refcount if the refcount is > 0. */ if (sk && refcount_inc_not_zero(&sk->sk_refcnt)) { skb->destructor = sock_efree; skb->sk = sk; } } /* * returns an unshared skb owned by the original sock to be echo'ed back */ static inline struct sk_buff *can_create_echo_skb(struct sk_buff *skb) { struct sk_buff *nskb; nskb = skb_clone(skb, GFP_ATOMIC); if (unlikely(!nskb)) { kfree_skb(skb); return NULL; } can_skb_set_owner(nskb, skb->sk); consume_skb(skb); return nskb; } static inline bool can_is_can_skb(const struct sk_buff *skb) { struct can_frame *cf = (struct can_frame *)skb->data; /* the CAN specific type of skb is identified by its data length */ return (skb->len == CAN_MTU && cf->len <= CAN_MAX_DLEN); } static inline bool can_is_canfd_skb(const struct sk_buff *skb) { struct canfd_frame *cfd = (struct canfd_frame *)skb->data; /* the CAN specific type of skb is identified by its data length */ return (skb->len == CANFD_MTU && cfd->len <= CANFD_MAX_DLEN); } static inline bool can_is_canxl_skb(const struct sk_buff *skb) { const struct canxl_frame *cxl = (struct canxl_frame *)skb->data; if (skb->len < CANXL_HDR_SIZE + CANXL_MIN_DLEN || skb->len > CANXL_MTU) return false; /* this also checks valid CAN XL data length boundaries */ if (skb->len != CANXL_HDR_SIZE + cxl->len) return false; return cxl->flags & CANXL_XLF; } /* get length element value from can[|fd|xl]_frame structure */ static inline unsigned int can_skb_get_len_val(struct sk_buff *skb) { const struct canxl_frame *cxl = (struct canxl_frame *)skb->data; const struct canfd_frame *cfd = (struct canfd_frame *)skb->data; if (can_is_canxl_skb(skb)) return cxl->len; return cfd->len; } /* get needed data length inside CAN frame for all frame types (RTR aware) */ static inline unsigned int can_skb_get_data_len(struct sk_buff *skb) { unsigned int len = can_skb_get_len_val(skb); const struct can_frame *cf = (struct can_frame *)skb->data; /* RTR frames have an actual length of zero */ if (can_is_can_skb(skb) && cf->can_id & CAN_RTR_FLAG) return 0; return len; } #endif /* !_CAN_SKB_H */ |
| 1 1 1 1 1 28 28 29 31 31 31 31 31 13 37 13 13 13 28 2 1 3 2 2 29 28 1 28 29 28 28 28 29 28 29 10 10 10 10 2 2 2 2 1 1 2 2 2 2 2 2 1 2 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 | /* FUSE: Filesystem in Userspace Copyright (C) 2001-2008 Miklos Szeredi <miklos@szeredi.hu> This program can be distributed under the terms of the GNU GPL. See the file COPYING. */ #include "fuse_i.h" #include "fuse_dev_i.h" #include "dev_uring_i.h" #include <linux/dax.h> #include <linux/pagemap.h> #include <linux/slab.h> #include <linux/file.h> #include <linux/seq_file.h> #include <linux/init.h> #include <linux/module.h> #include <linux/moduleparam.h> #include <linux/fs_context.h> #include <linux/fs_parser.h> #include <linux/statfs.h> #include <linux/random.h> #include <linux/sched.h> #include <linux/exportfs.h> #include <linux/posix_acl.h> #include <linux/pid_namespace.h> #include <uapi/linux/magic.h> MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>"); MODULE_DESCRIPTION("Filesystem in Userspace"); MODULE_LICENSE("GPL"); static struct kmem_cache *fuse_inode_cachep; struct list_head fuse_conn_list; DEFINE_MUTEX(fuse_mutex); DECLARE_WAIT_QUEUE_HEAD(fuse_dev_waitq); static int set_global_limit(const char *val, const struct kernel_param *kp); unsigned int fuse_max_pages_limit = 256; /* default is no timeout */ unsigned int fuse_default_req_timeout; unsigned int fuse_max_req_timeout; unsigned int max_user_bgreq; module_param_call(max_user_bgreq, set_global_limit, param_get_uint, &max_user_bgreq, 0644); __MODULE_PARM_TYPE(max_user_bgreq, "uint"); MODULE_PARM_DESC(max_user_bgreq, "Global limit for the maximum number of backgrounded requests an " "unprivileged user can set"); unsigned int max_user_congthresh; module_param_call(max_user_congthresh, set_global_limit, param_get_uint, &max_user_congthresh, 0644); __MODULE_PARM_TYPE(max_user_congthresh, "uint"); MODULE_PARM_DESC(max_user_congthresh, "Global limit for the maximum congestion threshold an " "unprivileged user can set"); #define FUSE_DEFAULT_BLKSIZE 512 /** Maximum number of outstanding background requests */ #define FUSE_DEFAULT_MAX_BACKGROUND 12 /** Congestion starts at 75% of maximum */ #define FUSE_DEFAULT_CONGESTION_THRESHOLD (FUSE_DEFAULT_MAX_BACKGROUND * 3 / 4) #ifdef CONFIG_BLOCK static struct file_system_type fuseblk_fs_type; #endif struct fuse_forget_link *fuse_alloc_forget(void) { return kzalloc(sizeof(struct fuse_forget_link), GFP_KERNEL_ACCOUNT); } static struct fuse_submount_lookup *fuse_alloc_submount_lookup(void) { struct fuse_submount_lookup *sl; sl = kzalloc(sizeof(struct fuse_submount_lookup), GFP_KERNEL_ACCOUNT); if (!sl) return NULL; sl->forget = fuse_alloc_forget(); if (!sl->forget) goto out_free; return sl; out_free: kfree(sl); return NULL; } static struct inode *fuse_alloc_inode(struct super_block *sb) { struct fuse_inode *fi; fi = alloc_inode_sb(sb, fuse_inode_cachep, GFP_KERNEL); if (!fi) return NULL; /* Initialize private data (i.e. everything except fi->inode) */ BUILD_BUG_ON(offsetof(struct fuse_inode, inode) != 0); memset((void *) fi + sizeof(fi->inode), 0, sizeof(*fi) - sizeof(fi->inode)); fi->inval_mask = ~0; mutex_init(&fi->mutex); spin_lock_init(&fi->lock); fi->forget = fuse_alloc_forget(); if (!fi->forget) goto out_free; if (IS_ENABLED(CONFIG_FUSE_DAX) && !fuse_dax_inode_alloc(sb, fi)) goto out_free_forget; if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH)) fuse_inode_backing_set(fi, NULL); return &fi->inode; out_free_forget: kfree(fi->forget); out_free: kmem_cache_free(fuse_inode_cachep, fi); return NULL; } static void fuse_free_inode(struct inode *inode) { struct fuse_inode *fi = get_fuse_inode(inode); mutex_destroy(&fi->mutex); kfree(fi->forget); #ifdef CONFIG_FUSE_DAX kfree(fi->dax); #endif if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH)) fuse_backing_put(fuse_inode_backing(fi)); kmem_cache_free(fuse_inode_cachep, fi); } static void fuse_cleanup_submount_lookup(struct fuse_conn *fc, struct fuse_submount_lookup *sl) { if (!refcount_dec_and_test(&sl->count)) return; fuse_queue_forget(fc, sl->forget, sl->nodeid, 1); sl->forget = NULL; kfree(sl); } static void fuse_evict_inode(struct inode *inode) { struct fuse_inode *fi = get_fuse_inode(inode); /* Will write inode on close/munmap and in all other dirtiers */ WARN_ON(inode->i_state & I_DIRTY_INODE); if (FUSE_IS_DAX(inode)) dax_break_layout_final(inode); truncate_inode_pages_final(&inode->i_data); clear_inode(inode); if (inode->i_sb->s_flags & SB_ACTIVE) { struct fuse_conn *fc = get_fuse_conn(inode); if (FUSE_IS_DAX(inode)) fuse_dax_inode_cleanup(inode); if (fi->nlookup) { fuse_queue_forget(fc, fi->forget, fi->nodeid, fi->nlookup); fi->forget = NULL; } if (fi->submount_lookup) { fuse_cleanup_submount_lookup(fc, fi->submount_lookup); fi->submount_lookup = NULL; } /* * Evict of non-deleted inode may race with outstanding * LOOKUP/READDIRPLUS requests and result in inconsistency when * the request finishes. Deal with that here by bumping a * counter that can be compared to the starting value. */ if (inode->i_nlink > 0) atomic64_inc(&fc->evict_ctr); } if (S_ISREG(inode->i_mode) && !fuse_is_bad(inode)) { WARN_ON(fi->iocachectr != 0); WARN_ON(!list_empty(&fi->write_files)); WARN_ON(!list_empty(&fi->queued_writes)); } } static int fuse_reconfigure(struct fs_context *fsc) { struct super_block *sb = fsc->root->d_sb; sync_filesystem(sb); if (fsc->sb_flags & SB_MANDLOCK) return -EINVAL; return 0; } /* * ino_t is 32-bits on 32-bit arch. We have to squash the 64-bit value down * so that it will fit. */ static ino_t fuse_squash_ino(u64 ino64) { ino_t ino = (ino_t) ino64; if (sizeof(ino_t) < sizeof(u64)) ino ^= ino64 >> (sizeof(u64) - sizeof(ino_t)) * 8; return ino; } void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr, struct fuse_statx *sx, u64 attr_valid, u32 cache_mask, u64 evict_ctr) { struct fuse_conn *fc = get_fuse_conn(inode); struct fuse_inode *fi = get_fuse_inode(inode); lockdep_assert_held(&fi->lock); /* * Clear basic stats from invalid mask. * * Don't do this if this is coming from a fuse_iget() call and there * might have been a racing evict which would've invalidated the result * if the attr_version would've been preserved. * * !evict_ctr -> this is create * fi->attr_version != 0 -> this is not a new inode * evict_ctr == fuse_get_evict_ctr() -> no evicts while during request */ if (!evict_ctr || fi->attr_version || evict_ctr == fuse_get_evict_ctr(fc)) set_mask_bits(&fi->inval_mask, STATX_BASIC_STATS, 0); fi->attr_version = atomic64_inc_return(&fc->attr_version); fi->i_time = attr_valid; inode->i_ino = fuse_squash_ino(attr->ino); inode->i_mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777); set_nlink(inode, attr->nlink); inode->i_uid = make_kuid(fc->user_ns, attr->uid); inode->i_gid = make_kgid(fc->user_ns, attr->gid); inode->i_blocks = attr->blocks; /* Sanitize nsecs */ attr->atimensec = min_t(u32, attr->atimensec, NSEC_PER_SEC - 1); attr->mtimensec = min_t(u32, attr->mtimensec, NSEC_PER_SEC - 1); attr->ctimensec = min_t(u32, attr->ctimensec, NSEC_PER_SEC - 1); inode_set_atime(inode, attr->atime, attr->atimensec); /* mtime from server may be stale due to local buffered write */ if (!(cache_mask & STATX_MTIME)) { inode_set_mtime(inode, attr->mtime, attr->mtimensec); } if (!(cache_mask & STATX_CTIME)) { inode_set_ctime(inode, attr->ctime, attr->ctimensec); } if (sx) { /* Sanitize nsecs */ sx->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1); /* * Btime has been queried, cache is valid (whether or not btime * is available or not) so clear STATX_BTIME from inval_mask. * * Availability of the btime attribute is indicated in * FUSE_I_BTIME */ set_mask_bits(&fi->inval_mask, STATX_BTIME, 0); if (sx->mask & STATX_BTIME) { set_bit(FUSE_I_BTIME, &fi->state); fi->i_btime.tv_sec = sx->btime.tv_sec; fi->i_btime.tv_nsec = sx->btime.tv_nsec; } } if (attr->blksize) fi->cached_i_blkbits = ilog2(attr->blksize); else fi->cached_i_blkbits = fc->blkbits; /* * Don't set the sticky bit in i_mode, unless we want the VFS * to check permissions. This prevents failures due to the * check in may_delete(). */ fi->orig_i_mode = inode->i_mode; if (!fc->default_permissions) inode->i_mode &= ~S_ISVTX; fi->orig_ino = attr->ino; /* * We are refreshing inode data and it is possible that another * client set suid/sgid or security.capability xattr. So clear * S_NOSEC. Ideally, we could have cleared it only if suid/sgid * was set or if security.capability xattr was set. But we don't * know if security.capability has been set or not. So clear it * anyway. Its less efficient but should be safe. */ inode->i_flags &= ~S_NOSEC; } u32 fuse_get_cache_mask(struct inode *inode) { struct fuse_conn *fc = get_fuse_conn(inode); if (!fc->writeback_cache || !S_ISREG(inode->i_mode)) return 0; return STATX_MTIME | STATX_CTIME | STATX_SIZE; } static void fuse_change_attributes_i(struct inode *inode, struct fuse_attr *attr, struct fuse_statx *sx, u64 attr_valid, u64 attr_version, u64 evict_ctr) { struct fuse_conn *fc = get_fuse_conn(inode); struct fuse_inode *fi = get_fuse_inode(inode); u32 cache_mask; loff_t oldsize; struct timespec64 old_mtime; spin_lock(&fi->lock); /* * In case of writeback_cache enabled, writes update mtime, ctime and * may update i_size. In these cases trust the cached value in the * inode. */ cache_mask = fuse_get_cache_mask(inode); if (cache_mask & STATX_SIZE) attr->size = i_size_read(inode); if (cache_mask & STATX_MTIME) { attr->mtime = inode_get_mtime_sec(inode); attr->mtimensec = inode_get_mtime_nsec(inode); } if (cache_mask & STATX_CTIME) { attr->ctime = inode_get_ctime_sec(inode); attr->ctimensec = inode_get_ctime_nsec(inode); } if ((attr_version != 0 && fi->attr_version > attr_version) || test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) { spin_unlock(&fi->lock); return; } old_mtime = inode_get_mtime(inode); fuse_change_attributes_common(inode, attr, sx, attr_valid, cache_mask, evict_ctr); oldsize = inode->i_size; /* * In case of writeback_cache enabled, the cached writes beyond EOF * extend local i_size without keeping userspace server in sync. So, * attr->size coming from server can be stale. We cannot trust it. */ if (!(cache_mask & STATX_SIZE)) i_size_write(inode, attr->size); spin_unlock(&fi->lock); if (!cache_mask && S_ISREG(inode->i_mode)) { bool inval = false; if (oldsize != attr->size) { truncate_pagecache(inode, attr->size); if (!fc->explicit_inval_data) inval = true; } else if (fc->auto_inval_data) { struct timespec64 new_mtime = { .tv_sec = attr->mtime, .tv_nsec = attr->mtimensec, }; /* * Auto inval mode also checks and invalidates if mtime * has changed. */ if (!timespec64_equal(&old_mtime, &new_mtime)) inval = true; } if (inval) invalidate_inode_pages2(inode->i_mapping); } if (IS_ENABLED(CONFIG_FUSE_DAX)) fuse_dax_dontcache(inode, attr->flags); } void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr, struct fuse_statx *sx, u64 attr_valid, u64 attr_version) { fuse_change_attributes_i(inode, attr, sx, attr_valid, attr_version, 0); } static void fuse_init_submount_lookup(struct fuse_submount_lookup *sl, u64 nodeid) { sl->nodeid = nodeid; refcount_set(&sl->count, 1); } static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr, struct fuse_conn *fc) { inode->i_mode = attr->mode & S_IFMT; inode->i_size = attr->size; inode_set_mtime(inode, attr->mtime, attr->mtimensec); inode_set_ctime(inode, attr->ctime, attr->ctimensec); if (S_ISREG(inode->i_mode)) { fuse_init_common(inode); fuse_init_file_inode(inode, attr->flags); } else if (S_ISDIR(inode->i_mode)) fuse_init_dir(inode); else if (S_ISLNK(inode->i_mode)) fuse_init_symlink(inode); else if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode) || S_ISFIFO(inode->i_mode) || S_ISSOCK(inode->i_mode)) { fuse_init_common(inode); init_special_inode(inode, inode->i_mode, new_decode_dev(attr->rdev)); } else BUG(); /* * Ensure that we don't cache acls for daemons without FUSE_POSIX_ACL * so they see the exact same behavior as before. */ if (!fc->posix_acl) inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE; } static int fuse_inode_eq(struct inode *inode, void *_nodeidp) { u64 nodeid = *(u64 *) _nodeidp; if (get_node_id(inode) == nodeid) return 1; else return 0; } static int fuse_inode_set(struct inode *inode, void *_nodeidp) { u64 nodeid = *(u64 *) _nodeidp; get_fuse_inode(inode)->nodeid = nodeid; return 0; } struct inode *fuse_iget(struct super_block *sb, u64 nodeid, int generation, struct fuse_attr *attr, u64 attr_valid, u64 attr_version, u64 evict_ctr) { struct inode *inode; struct fuse_inode *fi; struct fuse_conn *fc = get_fuse_conn_super(sb); /* * Auto mount points get their node id from the submount root, which is * not a unique identifier within this filesystem. * * To avoid conflicts, do not place submount points into the inode hash * table. */ if (fc->auto_submounts && (attr->flags & FUSE_ATTR_SUBMOUNT) && S_ISDIR(attr->mode)) { struct fuse_inode *fi; inode = new_inode(sb); if (!inode) return NULL; fuse_init_inode(inode, attr, fc); fi = get_fuse_inode(inode); fi->nodeid = nodeid; fi->submount_lookup = fuse_alloc_submount_lookup(); if (!fi->submount_lookup) { iput(inode); return NULL; } /* Sets nlookup = 1 on fi->submount_lookup->nlookup */ fuse_init_submount_lookup(fi->submount_lookup, nodeid); inode->i_flags |= S_AUTOMOUNT; goto done; } retry: inode = iget5_locked(sb, nodeid, fuse_inode_eq, fuse_inode_set, &nodeid); if (!inode) return NULL; if ((inode->i_state & I_NEW)) { inode->i_flags |= S_NOATIME; if (!fc->writeback_cache || !S_ISREG(attr->mode)) inode->i_flags |= S_NOCMTIME; inode->i_generation = generation; fuse_init_inode(inode, attr, fc); unlock_new_inode(inode); } else if (fuse_stale_inode(inode, generation, attr)) { /* nodeid was reused, any I/O on the old inode should fail */ fuse_make_bad(inode); if (inode != d_inode(sb->s_root)) { remove_inode_hash(inode); iput(inode); goto retry; } } fi = get_fuse_inode(inode); spin_lock(&fi->lock); fi->nlookup++; spin_unlock(&fi->lock); done: fuse_change_attributes_i(inode, attr, NULL, attr_valid, attr_version, evict_ctr); return inode; } struct inode *fuse_ilookup(struct fuse_conn *fc, u64 nodeid, struct fuse_mount **fm) { struct fuse_mount *fm_iter; struct inode *inode; WARN_ON(!rwsem_is_locked(&fc->killsb)); list_for_each_entry(fm_iter, &fc->mounts, fc_entry) { if (!fm_iter->sb) continue; inode = ilookup5(fm_iter->sb, nodeid, fuse_inode_eq, &nodeid); if (inode) { if (fm) *fm = fm_iter; return inode; } } return NULL; } int fuse_reverse_inval_inode(struct fuse_conn *fc, u64 nodeid, loff_t offset, loff_t len) { struct fuse_inode *fi; struct inode *inode; pgoff_t pg_start; pgoff_t pg_end; inode = fuse_ilookup(fc, nodeid, NULL); if (!inode) return -ENOENT; fi = get_fuse_inode(inode); spin_lock(&fi->lock); fi->attr_version = atomic64_inc_return(&fc->attr_version); spin_unlock(&fi->lock); fuse_invalidate_attr(inode); forget_all_cached_acls(inode); if (offset >= 0) { pg_start = offset >> PAGE_SHIFT; if (len <= 0) pg_end = -1; else pg_end = (offset + len - 1) >> PAGE_SHIFT; invalidate_inode_pages2_range(inode->i_mapping, pg_start, pg_end); } iput(inode); return 0; } void fuse_try_prune_one_inode(struct fuse_conn *fc, u64 nodeid) { struct inode *inode; inode = fuse_ilookup(fc, nodeid, NULL); if (!inode) return; d_prune_aliases(inode); iput(inode); } bool fuse_lock_inode(struct inode *inode) { bool locked = false; if (!get_fuse_conn(inode)->parallel_dirops) { mutex_lock(&get_fuse_inode(inode)->mutex); locked = true; } return locked; } void fuse_unlock_inode(struct inode *inode, bool locked) { if (locked) mutex_unlock(&get_fuse_inode(inode)->mutex); } static void fuse_umount_begin(struct super_block *sb) { struct fuse_conn *fc = get_fuse_conn_super(sb); if (fc->no_force_umount) return; fuse_abort_conn(fc); // Only retire block-device-based superblocks. if (sb->s_bdev != NULL) retire_super(sb); } static void fuse_send_destroy(struct fuse_mount *fm) { if (fm->fc->conn_init) { FUSE_ARGS(args); args.opcode = FUSE_DESTROY; args.force = true; args.nocreds = true; fuse_simple_request(fm, &args); } } static void convert_fuse_statfs(struct kstatfs *stbuf, struct fuse_kstatfs *attr) { stbuf->f_type = FUSE_SUPER_MAGIC; stbuf->f_bsize = attr->bsize; stbuf->f_frsize = attr->frsize; stbuf->f_blocks = attr->blocks; stbuf->f_bfree = attr->bfree; stbuf->f_bavail = attr->bavail; stbuf->f_files = attr->files; stbuf->f_ffree = attr->ffree; stbuf->f_namelen = attr->namelen; /* fsid is left zero */ } static int fuse_statfs(struct dentry *dentry, struct kstatfs *buf) { struct super_block *sb = dentry->d_sb; struct fuse_mount *fm = get_fuse_mount_super(sb); FUSE_ARGS(args); struct fuse_statfs_out outarg; int err; if (!fuse_allow_current_process(fm->fc)) { buf->f_type = FUSE_SUPER_MAGIC; return 0; } memset(&outarg, 0, sizeof(outarg)); args.in_numargs = 0; args.opcode = FUSE_STATFS; args.nodeid = get_node_id(d_inode(dentry)); args.out_numargs = 1; args.out_args[0].size = sizeof(outarg); args.out_args[0].value = &outarg; err = fuse_simple_request(fm, &args); if (!err) convert_fuse_statfs(buf, &outarg.st); return err; } static struct fuse_sync_bucket *fuse_sync_bucket_alloc(void) { struct fuse_sync_bucket *bucket; bucket = kzalloc(sizeof(*bucket), GFP_KERNEL | __GFP_NOFAIL); if (bucket) { init_waitqueue_head(&bucket->waitq); /* Initial active count */ atomic_set(&bucket->count, 1); } return bucket; } static void fuse_sync_fs_writes(struct fuse_conn *fc) { struct fuse_sync_bucket *bucket, *new_bucket; int count; new_bucket = fuse_sync_bucket_alloc(); spin_lock(&fc->lock); bucket = rcu_dereference_protected(fc->curr_bucket, 1); count = atomic_read(&bucket->count); WARN_ON(count < 1); /* No outstanding writes? */ if (count == 1) { spin_unlock(&fc->lock); kfree(new_bucket); return; } /* * Completion of new bucket depends on completion of this bucket, so add * one more count. */ atomic_inc(&new_bucket->count); rcu_assign_pointer(fc->curr_bucket, new_bucket); spin_unlock(&fc->lock); /* * Drop initial active count. At this point if all writes in this and * ancestor buckets complete, the count will go to zero and this task * will be woken up. */ atomic_dec(&bucket->count); wait_event(bucket->waitq, atomic_read(&bucket->count) == 0); /* Drop temp count on descendant bucket */ fuse_sync_bucket_dec(new_bucket); kfree_rcu(bucket, rcu); } static int fuse_sync_fs(struct super_block *sb, int wait) { struct fuse_mount *fm = get_fuse_mount_super(sb); struct fuse_conn *fc = fm->fc; struct fuse_syncfs_in inarg; FUSE_ARGS(args); int err; /* * Userspace cannot handle the wait == 0 case. Avoid a * gratuitous roundtrip. */ if (!wait) return 0; /* The filesystem is being unmounted. Nothing to do. */ if (!sb->s_root) return 0; if (!fc->sync_fs) return 0; fuse_sync_fs_writes(fc); memset(&inarg, 0, sizeof(inarg)); args.in_numargs = 1; args.in_args[0].size = sizeof(inarg); args.in_args[0].value = &inarg; args.opcode = FUSE_SYNCFS; args.nodeid = get_node_id(sb->s_root->d_inode); args.out_numargs = 0; err = fuse_simple_request(fm, &args); if (err == -ENOSYS) { fc->sync_fs = 0; err = 0; } return err; } enum { OPT_SOURCE, OPT_SUBTYPE, OPT_FD, OPT_ROOTMODE, OPT_USER_ID, OPT_GROUP_ID, OPT_DEFAULT_PERMISSIONS, OPT_ALLOW_OTHER, OPT_MAX_READ, OPT_BLKSIZE, OPT_ERR }; static const struct fs_parameter_spec fuse_fs_parameters[] = { fsparam_string ("source", OPT_SOURCE), fsparam_u32 ("fd", OPT_FD), fsparam_u32oct ("rootmode", OPT_ROOTMODE), fsparam_uid ("user_id", OPT_USER_ID), fsparam_gid ("group_id", OPT_GROUP_ID), fsparam_flag ("default_permissions", OPT_DEFAULT_PERMISSIONS), fsparam_flag ("allow_other", OPT_ALLOW_OTHER), fsparam_u32 ("max_read", OPT_MAX_READ), fsparam_u32 ("blksize", OPT_BLKSIZE), fsparam_string ("subtype", OPT_SUBTYPE), {} }; static int fuse_parse_param(struct fs_context *fsc, struct fs_parameter *param) { struct fs_parse_result result; struct fuse_fs_context *ctx = fsc->fs_private; int opt; kuid_t kuid; kgid_t kgid; if (fsc->purpose == FS_CONTEXT_FOR_RECONFIGURE) { /* * Ignore options coming from mount(MS_REMOUNT) for backward * compatibility. */ if (fsc->oldapi) return 0; return invalfc(fsc, "No changes allowed in reconfigure"); } opt = fs_parse(fsc, fuse_fs_parameters, param, &result); if (opt < 0) return opt; switch (opt) { case OPT_SOURCE: if (fsc->source) return invalfc(fsc, "Multiple sources specified"); fsc->source = param->string; param->string = NULL; break; case OPT_SUBTYPE: if (ctx->subtype) return invalfc(fsc, "Multiple subtypes specified"); ctx->subtype = param->string; param->string = NULL; return 0; case OPT_FD: ctx->fd = result.uint_32; ctx->fd_present = true; break; case OPT_ROOTMODE: if (!fuse_valid_type(result.uint_32)) return invalfc(fsc, "Invalid rootmode"); ctx->rootmode = result.uint_32; ctx->rootmode_present = true; break; case OPT_USER_ID: kuid = result.uid; /* * The requested uid must be representable in the * filesystem's idmapping. */ if (!kuid_has_mapping(fsc->user_ns, kuid)) return invalfc(fsc, "Invalid user_id"); ctx->user_id = kuid; ctx->user_id_present = true; break; case OPT_GROUP_ID: kgid = result.gid; /* * The requested gid must be representable in the * filesystem's idmapping. */ if (!kgid_has_mapping(fsc->user_ns, kgid)) return invalfc(fsc, "Invalid group_id"); ctx->group_id = kgid; ctx->group_id_present = true; break; case OPT_DEFAULT_PERMISSIONS: ctx->default_permissions = true; break; case OPT_ALLOW_OTHER: ctx->allow_other = true; break; case OPT_MAX_READ: ctx->max_read = result.uint_32; break; case OPT_BLKSIZE: if (!ctx->is_bdev) return invalfc(fsc, "blksize only supported for fuseblk"); ctx->blksize = result.uint_32; break; default: return -EINVAL; } return 0; } static void fuse_free_fsc(struct fs_context *fsc) { struct fuse_fs_context *ctx = fsc->fs_private; if (ctx) { kfree(ctx->subtype); kfree(ctx); } } static int fuse_show_options(struct seq_file *m, struct dentry *root) { struct super_block *sb = root->d_sb; struct fuse_conn *fc = get_fuse_conn_super(sb); if (fc->legacy_opts_show) { seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id)); seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id)); if (fc->default_permissions) seq_puts(m, ",default_permissions"); if (fc->allow_other) seq_puts(m, ",allow_other"); if (fc->max_read != ~0) seq_printf(m, ",max_read=%u", fc->max_read); if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE) seq_printf(m, ",blksize=%lu", sb->s_blocksize); } #ifdef CONFIG_FUSE_DAX if (fc->dax_mode == FUSE_DAX_ALWAYS) seq_puts(m, ",dax=always"); else if (fc->dax_mode == FUSE_DAX_NEVER) seq_puts(m, ",dax=never"); else if (fc->dax_mode == FUSE_DAX_INODE_USER) seq_puts(m, ",dax=inode"); #endif return 0; } static void fuse_iqueue_init(struct fuse_iqueue *fiq, const struct fuse_iqueue_ops *ops, void *priv) { memset(fiq, 0, sizeof(struct fuse_iqueue)); spin_lock_init(&fiq->lock); init_waitqueue_head(&fiq->waitq); INIT_LIST_HEAD(&fiq->pending); INIT_LIST_HEAD(&fiq->interrupts); fiq->forget_list_tail = &fiq->forget_list_head; fiq->connected = 1; fiq->ops = ops; fiq->priv = priv; } void fuse_pqueue_init(struct fuse_pqueue *fpq) { unsigned int i; spin_lock_init(&fpq->lock); for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) INIT_LIST_HEAD(&fpq->processing[i]); INIT_LIST_HEAD(&fpq->io); fpq->connected = 1; } void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm, struct user_namespace *user_ns, const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv) { memset(fc, 0, sizeof(*fc)); spin_lock_init(&fc->lock); spin_lock_init(&fc->bg_lock); init_rwsem(&fc->killsb); refcount_set(&fc->count, 1); atomic_set(&fc->dev_count, 1); atomic_set(&fc->epoch, 1); init_waitqueue_head(&fc->blocked_waitq); fuse_iqueue_init(&fc->iq, fiq_ops, fiq_priv); INIT_LIST_HEAD(&fc->bg_queue); INIT_LIST_HEAD(&fc->entry); INIT_LIST_HEAD(&fc->devices); atomic_set(&fc->num_waiting, 0); fc->max_background = FUSE_DEFAULT_MAX_BACKGROUND; fc->congestion_threshold = FUSE_DEFAULT_CONGESTION_THRESHOLD; atomic64_set(&fc->khctr, 0); fc->polled_files = RB_ROOT; fc->blocked = 0; fc->initialized = 0; fc->connected = 1; atomic64_set(&fc->attr_version, 1); atomic64_set(&fc->evict_ctr, 1); get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key)); fc->pid_ns = get_pid_ns(task_active_pid_ns(current)); fc->user_ns = get_user_ns(user_ns); fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ; fc->max_pages_limit = fuse_max_pages_limit; fc->name_max = FUSE_NAME_LOW_MAX; fc->timeout.req_timeout = 0; if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH)) fuse_backing_files_init(fc); INIT_LIST_HEAD(&fc->mounts); list_add(&fm->fc_entry, &fc->mounts); fm->fc = fc; } EXPORT_SYMBOL_GPL(fuse_conn_init); static void delayed_release(struct rcu_head *p) { struct fuse_conn *fc = container_of(p, struct fuse_conn, rcu); fuse_uring_destruct(fc); put_user_ns(fc->user_ns); fc->release(fc); } void fuse_conn_put(struct fuse_conn *fc) { if (refcount_dec_and_test(&fc->count)) { struct fuse_iqueue *fiq = &fc->iq; struct fuse_sync_bucket *bucket; if (IS_ENABLED(CONFIG_FUSE_DAX)) fuse_dax_conn_free(fc); if (fc->timeout.req_timeout) cancel_delayed_work_sync(&fc->timeout.work); if (fiq->ops->release) fiq->ops->release(fiq); put_pid_ns(fc->pid_ns); bucket = rcu_dereference_protected(fc->curr_bucket, 1); if (bucket) { WARN_ON(atomic_read(&bucket->count) != 1); kfree(bucket); } if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH)) fuse_backing_files_free(fc); call_rcu(&fc->rcu, delayed_release); } } EXPORT_SYMBOL_GPL(fuse_conn_put); struct fuse_conn *fuse_conn_get(struct fuse_conn *fc) { refcount_inc(&fc->count); return fc; } EXPORT_SYMBOL_GPL(fuse_conn_get); static struct inode *fuse_get_root_inode(struct super_block *sb, unsigned int mode) { struct fuse_attr attr; memset(&attr, 0, sizeof(attr)); attr.mode = mode; attr.ino = FUSE_ROOT_ID; attr.nlink = 1; return fuse_iget(sb, FUSE_ROOT_ID, 0, &attr, 0, 0, 0); } struct fuse_inode_handle { u64 nodeid; u32 generation; }; static struct dentry *fuse_get_dentry(struct super_block *sb, struct fuse_inode_handle *handle) { struct fuse_conn *fc = get_fuse_conn_super(sb); struct inode *inode; struct dentry *entry; int err = -ESTALE; if (handle->nodeid == 0) goto out_err; inode = ilookup5(sb, handle->nodeid, fuse_inode_eq, &handle->nodeid); if (!inode) { struct fuse_entry_out outarg; const struct qstr name = QSTR_INIT(".", 1); if (!fc->export_support) goto out_err; err = fuse_lookup_name(sb, handle->nodeid, &name, &outarg, &inode); if (err && err != -ENOENT) goto out_err; if (err || !inode) { err = -ESTALE; goto out_err; } err = -EIO; if (get_node_id(inode) != handle->nodeid) goto out_iput; } err = -ESTALE; if (inode->i_generation != handle->generation) goto out_iput; entry = d_obtain_alias(inode); if (!IS_ERR(entry) && get_node_id(inode) != FUSE_ROOT_ID) fuse_invalidate_entry_cache(entry); return entry; out_iput: iput(inode); out_err: return ERR_PTR(err); } static int fuse_encode_fh(struct inode *inode, u32 *fh, int *max_len, struct inode *parent) { int len = parent ? 6 : 3; u64 nodeid; u32 generation; if (*max_len < len) { *max_len = len; return FILEID_INVALID; } nodeid = get_fuse_inode(inode)->nodeid; generation = inode->i_generation; fh[0] = (u32)(nodeid >> 32); fh[1] = (u32)(nodeid & 0xffffffff); fh[2] = generation; if (parent) { nodeid = get_fuse_inode(parent)->nodeid; generation = parent->i_generation; fh[3] = (u32)(nodeid >> 32); fh[4] = (u32)(nodeid & 0xffffffff); fh[5] = generation; } *max_len = len; return parent ? FILEID_INO64_GEN_PARENT : FILEID_INO64_GEN; } static struct dentry *fuse_fh_to_dentry(struct super_block *sb, struct fid *fid, int fh_len, int fh_type) { struct fuse_inode_handle handle; if ((fh_type != FILEID_INO64_GEN && fh_type != FILEID_INO64_GEN_PARENT) || fh_len < 3) return NULL; handle.nodeid = (u64) fid->raw[0] << 32; handle.nodeid |= (u64) fid->raw[1]; handle.generation = fid->raw[2]; return fuse_get_dentry(sb, &handle); } static struct dentry *fuse_fh_to_parent(struct super_block *sb, struct fid *fid, int fh_len, int fh_type) { struct fuse_inode_handle parent; if (fh_type != FILEID_INO64_GEN_PARENT || fh_len < 6) return NULL; parent.nodeid = (u64) fid->raw[3] << 32; parent.nodeid |= (u64) fid->raw[4]; parent.generation = fid->raw[5]; return fuse_get_dentry(sb, &parent); } static struct dentry *fuse_get_parent(struct dentry *child) { struct inode *child_inode = d_inode(child); struct fuse_conn *fc = get_fuse_conn(child_inode); struct inode *inode; struct dentry *parent; struct fuse_entry_out outarg; int err; if (!fc->export_support) return ERR_PTR(-ESTALE); err = fuse_lookup_name(child_inode->i_sb, get_node_id(child_inode), &dotdot_name, &outarg, &inode); if (err) { if (err == -ENOENT) return ERR_PTR(-ESTALE); return ERR_PTR(err); } parent = d_obtain_alias(inode); if (!IS_ERR(parent) && get_node_id(inode) != FUSE_ROOT_ID) fuse_invalidate_entry_cache(parent); return parent; } /* only for fid encoding; no support for file handle */ static const struct export_operations fuse_export_fid_operations = { .encode_fh = fuse_encode_fh, }; static const struct export_operations fuse_export_operations = { .fh_to_dentry = fuse_fh_to_dentry, .fh_to_parent = fuse_fh_to_parent, .encode_fh = fuse_encode_fh, .get_parent = fuse_get_parent, }; static const struct super_operations fuse_super_operations = { .alloc_inode = fuse_alloc_inode, .free_inode = fuse_free_inode, .evict_inode = fuse_evict_inode, .write_inode = fuse_write_inode, .drop_inode = inode_just_drop, .umount_begin = fuse_umount_begin, .statfs = fuse_statfs, .sync_fs = fuse_sync_fs, .show_options = fuse_show_options, }; static void sanitize_global_limit(unsigned int *limit) { /* * The default maximum number of async requests is calculated to consume * 1/2^13 of the total memory, assuming 392 bytes per request. */ if (*limit == 0) *limit = ((totalram_pages() << PAGE_SHIFT) >> 13) / 392; if (*limit >= 1 << 16) *limit = (1 << 16) - 1; } static int set_global_limit(const char *val, const struct kernel_param *kp) { int rv; rv = param_set_uint(val, kp); if (rv) return rv; sanitize_global_limit((unsigned int *)kp->arg); return 0; } static void process_init_limits(struct fuse_conn *fc, struct fuse_init_out *arg) { int cap_sys_admin = capable(CAP_SYS_ADMIN); if (arg->minor < 13) return; sanitize_global_limit(&max_user_bgreq); sanitize_global_limit(&max_user_congthresh); spin_lock(&fc->bg_lock); if (arg->max_background) { fc->max_background = arg->max_background; if (!cap_sys_admin && fc->max_background > max_user_bgreq) fc->max_background = max_user_bgreq; } if (arg->congestion_threshold) { fc->congestion_threshold = arg->congestion_threshold; if (!cap_sys_admin && fc->congestion_threshold > max_user_congthresh) fc->congestion_threshold = max_user_congthresh; } spin_unlock(&fc->bg_lock); } static void set_request_timeout(struct fuse_conn *fc, unsigned int timeout) { fc->timeout.req_timeout = secs_to_jiffies(timeout); INIT_DELAYED_WORK(&fc->timeout.work, fuse_check_timeout); queue_delayed_work(system_percpu_wq, &fc->timeout.work, fuse_timeout_timer_freq); } static void init_server_timeout(struct fuse_conn *fc, unsigned int timeout) { if (!timeout && !fuse_max_req_timeout && !fuse_default_req_timeout) return; if (!timeout) timeout = fuse_default_req_timeout; if (fuse_max_req_timeout) { if (timeout) timeout = min(fuse_max_req_timeout, timeout); else timeout = fuse_max_req_timeout; } timeout = max(FUSE_TIMEOUT_TIMER_FREQ, timeout); set_request_timeout(fc, timeout); } struct fuse_init_args { struct fuse_args args; struct fuse_init_in in; struct fuse_init_out out; }; static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args, int error) { struct fuse_conn *fc = fm->fc; struct fuse_init_args *ia = container_of(args, typeof(*ia), args); struct fuse_init_out *arg = &ia->out; bool ok = true; if (error || arg->major != FUSE_KERNEL_VERSION) ok = false; else { unsigned long ra_pages; unsigned int timeout = 0; process_init_limits(fc, arg); if (arg->minor >= 6) { u64 flags = arg->flags; if (flags & FUSE_INIT_EXT) flags |= (u64) arg->flags2 << 32; ra_pages = arg->max_readahead / PAGE_SIZE; if (flags & FUSE_ASYNC_READ) fc->async_read = 1; if (!(flags & FUSE_POSIX_LOCKS)) fc->no_lock = 1; if (arg->minor >= 17) { if (!(flags & FUSE_FLOCK_LOCKS)) fc->no_flock = 1; } else { if (!(flags & FUSE_POSIX_LOCKS)) fc->no_flock = 1; } if (flags & FUSE_ATOMIC_O_TRUNC) fc->atomic_o_trunc = 1; if (arg->minor >= 9) { /* LOOKUP has dependency on proto version */ if (flags & FUSE_EXPORT_SUPPORT) fc->export_support = 1; } if (flags & FUSE_BIG_WRITES) fc->big_writes = 1; if (flags & FUSE_DONT_MASK) fc->dont_mask = 1; if (flags & FUSE_AUTO_INVAL_DATA) fc->auto_inval_data = 1; else if (flags & FUSE_EXPLICIT_INVAL_DATA) fc->explicit_inval_data = 1; if (flags & FUSE_DO_READDIRPLUS) { fc->do_readdirplus = 1; if (flags & FUSE_READDIRPLUS_AUTO) fc->readdirplus_auto = 1; } if (flags & FUSE_ASYNC_DIO) fc->async_dio = 1; if (flags & FUSE_WRITEBACK_CACHE) fc->writeback_cache = 1; if (flags & FUSE_PARALLEL_DIROPS) fc->parallel_dirops = 1; if (flags & FUSE_HANDLE_KILLPRIV) fc->handle_killpriv = 1; if (arg->time_gran && arg->time_gran <= 1000000000) fm->sb->s_time_gran = arg->time_gran; if ((flags & FUSE_POSIX_ACL)) { fc->default_permissions = 1; fc->posix_acl = 1; } if (flags & FUSE_CACHE_SYMLINKS) fc->cache_symlinks = 1; if (flags & FUSE_ABORT_ERROR) fc->abort_err = 1; if (flags & FUSE_MAX_PAGES) { fc->max_pages = min_t(unsigned int, fc->max_pages_limit, max_t(unsigned int, arg->max_pages, 1)); /* * PATH_MAX file names might need two pages for * ops like rename */ if (fc->max_pages > 1) fc->name_max = FUSE_NAME_MAX; } if (IS_ENABLED(CONFIG_FUSE_DAX)) { if (flags & FUSE_MAP_ALIGNMENT && !fuse_dax_check_alignment(fc, arg->map_alignment)) { ok = false; } if (flags & FUSE_HAS_INODE_DAX) fc->inode_dax = 1; } if (flags & FUSE_HANDLE_KILLPRIV_V2) { fc->handle_killpriv_v2 = 1; fm->sb->s_flags |= SB_NOSEC; } if (flags & FUSE_SETXATTR_EXT) fc->setxattr_ext = 1; if (flags & FUSE_SECURITY_CTX) fc->init_security = 1; if (flags & FUSE_CREATE_SUPP_GROUP) fc->create_supp_group = 1; if (flags & FUSE_DIRECT_IO_ALLOW_MMAP) fc->direct_io_allow_mmap = 1; /* * max_stack_depth is the max stack depth of FUSE fs, * so it has to be at least 1 to support passthrough * to backing files. * * with max_stack_depth > 1, the backing files can be * on a stacked fs (e.g. overlayfs) themselves and with * max_stack_depth == 1, FUSE fs can be stacked as the * underlying fs of a stacked fs (e.g. overlayfs). * * Also don't allow the combination of FUSE_PASSTHROUGH * and FUSE_WRITEBACK_CACHE, current design doesn't handle * them together. */ if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH) && (flags & FUSE_PASSTHROUGH) && arg->max_stack_depth > 0 && arg->max_stack_depth <= FILESYSTEM_MAX_STACK_DEPTH && !(flags & FUSE_WRITEBACK_CACHE)) { fc->passthrough = 1; fc->max_stack_depth = arg->max_stack_depth; fm->sb->s_stack_depth = arg->max_stack_depth; } if (flags & FUSE_NO_EXPORT_SUPPORT) fm->sb->s_export_op = &fuse_export_fid_operations; if (flags & FUSE_ALLOW_IDMAP) { if (fc->default_permissions) fm->sb->s_iflags &= ~SB_I_NOIDMAP; else ok = false; } if (flags & FUSE_OVER_IO_URING && fuse_uring_enabled()) fc->io_uring = 1; if (flags & FUSE_REQUEST_TIMEOUT) timeout = arg->request_timeout; } else { ra_pages = fc->max_read / PAGE_SIZE; fc->no_lock = 1; fc->no_flock = 1; } init_server_timeout(fc, timeout); fm->sb->s_bdi->ra_pages = min(fm->sb->s_bdi->ra_pages, ra_pages); fc->minor = arg->minor; fc->max_write = arg->minor < 5 ? 4096 : arg->max_write; fc->max_write = max_t(unsigned, 4096, fc->max_write); fc->conn_init = 1; } kfree(ia); if (!ok) { fc->conn_init = 0; fc->conn_error = 1; } fuse_set_initialized(fc); wake_up_all(&fc->blocked_waitq); } static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm) { struct fuse_init_args *ia; u64 flags; ia = kzalloc(sizeof(*ia), GFP_KERNEL | __GFP_NOFAIL); ia->in.major = FUSE_KERNEL_VERSION; ia->in.minor = FUSE_KERNEL_MINOR_VERSION; ia->in.max_readahead = fm->sb->s_bdi->ra_pages * PAGE_SIZE; flags = FUSE_ASYNC_READ | FUSE_POSIX_LOCKS | FUSE_ATOMIC_O_TRUNC | FUSE_EXPORT_SUPPORT | FUSE_BIG_WRITES | FUSE_DONT_MASK | FUSE_SPLICE_WRITE | FUSE_SPLICE_MOVE | FUSE_SPLICE_READ | FUSE_FLOCK_LOCKS | FUSE_HAS_IOCTL_DIR | FUSE_AUTO_INVAL_DATA | FUSE_DO_READDIRPLUS | FUSE_READDIRPLUS_AUTO | FUSE_ASYNC_DIO | FUSE_WRITEBACK_CACHE | FUSE_NO_OPEN_SUPPORT | FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL | FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS | FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA | FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT | FUSE_SECURITY_CTX | FUSE_CREATE_SUPP_GROUP | FUSE_HAS_EXPIRE_ONLY | FUSE_DIRECT_IO_ALLOW_MMAP | FUSE_NO_EXPORT_SUPPORT | FUSE_HAS_RESEND | FUSE_ALLOW_IDMAP | FUSE_REQUEST_TIMEOUT; #ifdef CONFIG_FUSE_DAX if (fm->fc->dax) flags |= FUSE_MAP_ALIGNMENT; if (fuse_is_inode_dax_mode(fm->fc->dax_mode)) flags |= FUSE_HAS_INODE_DAX; #endif if (fm->fc->auto_submounts) flags |= FUSE_SUBMOUNTS; if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH)) flags |= FUSE_PASSTHROUGH; /* * This is just an information flag for fuse server. No need to check * the reply - server is either sending IORING_OP_URING_CMD or not. */ if (fuse_uring_enabled()) flags |= FUSE_OVER_IO_URING; ia->in.flags = flags; ia->in.flags2 = flags >> 32; ia->args.opcode = FUSE_INIT; ia->args.in_numargs = 1; ia->args.in_args[0].size = sizeof(ia->in); ia->args.in_args[0].value = &ia->in; ia->args.out_numargs = 1; /* Variable length argument used for backward compatibility with interface version < 7.5. Rest of init_out is zeroed by do_get_request(), so a short reply is not a problem */ ia->args.out_argvar = true; ia->args.out_args[0].size = sizeof(ia->out); ia->args.out_args[0].value = &ia->out; ia->args.force = true; ia->args.nocreds = true; return ia; } int fuse_send_init(struct fuse_mount *fm) { struct fuse_init_args *ia = fuse_new_init(fm); int err; if (fm->fc->sync_init) { err = fuse_simple_request(fm, &ia->args); /* Ignore size of init reply */ if (err > 0) err = 0; } else { ia->args.end = process_init_reply; err = fuse_simple_background(fm, &ia->args, GFP_KERNEL); if (!err) return 0; } process_init_reply(fm, &ia->args, err); if (fm->fc->conn_error) return -ENOTCONN; return 0; } EXPORT_SYMBOL_GPL(fuse_send_init); void fuse_free_conn(struct fuse_conn *fc) { WARN_ON(!list_empty(&fc->devices)); kfree(fc); } EXPORT_SYMBOL_GPL(fuse_free_conn); static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb) { int err; char *suffix = ""; if (sb->s_bdev) { suffix = "-fuseblk"; /* * sb->s_bdi points to blkdev's bdi however we want to redirect * it to our private bdi... */ bdi_put(sb->s_bdi); sb->s_bdi = &noop_backing_dev_info; } err = super_setup_bdi_name(sb, "%u:%u%s", MAJOR(fc->dev), MINOR(fc->dev), suffix); if (err) return err; sb->s_bdi->capabilities |= BDI_CAP_STRICTLIMIT; /* * For a single fuse filesystem use max 1% of dirty + * writeback threshold. * * This gives about 1M of write buffer for memory maps on a * machine with 1G and 10% dirty_ratio, which should be more * than enough. * * Privileged users can raise it by writing to * * /sys/class/bdi/<bdi>/max_ratio */ bdi_set_max_ratio(sb->s_bdi, 1); return 0; } struct fuse_dev *fuse_dev_alloc(void) { struct fuse_dev *fud; struct list_head *pq; fud = kzalloc(sizeof(struct fuse_dev), GFP_KERNEL); if (!fud) return NULL; pq = kcalloc(FUSE_PQ_HASH_SIZE, sizeof(struct list_head), GFP_KERNEL); if (!pq) { kfree(fud); return NULL; } fud->pq.processing = pq; fuse_pqueue_init(&fud->pq); return fud; } EXPORT_SYMBOL_GPL(fuse_dev_alloc); void fuse_dev_install(struct fuse_dev *fud, struct fuse_conn *fc) { fud->fc = fuse_conn_get(fc); spin_lock(&fc->lock); list_add_tail(&fud->entry, &fc->devices); spin_unlock(&fc->lock); } EXPORT_SYMBOL_GPL(fuse_dev_install); struct fuse_dev *fuse_dev_alloc_install(struct fuse_conn *fc) { struct fuse_dev *fud; fud = fuse_dev_alloc(); if (!fud) return NULL; fuse_dev_install(fud, fc); return fud; } EXPORT_SYMBOL_GPL(fuse_dev_alloc_install); void fuse_dev_free(struct fuse_dev *fud) { struct fuse_conn *fc = fud->fc; if (fc) { spin_lock(&fc->lock); list_del(&fud->entry); spin_unlock(&fc->lock); fuse_conn_put(fc); } kfree(fud->pq.processing); kfree(fud); } EXPORT_SYMBOL_GPL(fuse_dev_free); static void fuse_fill_attr_from_inode(struct fuse_attr *attr, const struct fuse_inode *fi) { struct timespec64 atime = inode_get_atime(&fi->inode); struct timespec64 mtime = inode_get_mtime(&fi->inode); struct timespec64 ctime = inode_get_ctime(&fi->inode); *attr = (struct fuse_attr){ .ino = fi->inode.i_ino, .size = fi->inode.i_size, .blocks = fi->inode.i_blocks, .atime = atime.tv_sec, .mtime = mtime.tv_sec, .ctime = ctime.tv_sec, .atimensec = atime.tv_nsec, .mtimensec = mtime.tv_nsec, .ctimensec = ctime.tv_nsec, .mode = fi->inode.i_mode, .nlink = fi->inode.i_nlink, .uid = __kuid_val(fi->inode.i_uid), .gid = __kgid_val(fi->inode.i_gid), .rdev = fi->inode.i_rdev, .blksize = 1u << fi->inode.i_blkbits, }; } static void fuse_sb_defaults(struct super_block *sb) { sb->s_magic = FUSE_SUPER_MAGIC; sb->s_op = &fuse_super_operations; sb->s_xattr = fuse_xattr_handlers; sb->s_maxbytes = MAX_LFS_FILESIZE; sb->s_time_gran = 1; sb->s_export_op = &fuse_export_operations; sb->s_iflags |= SB_I_IMA_UNVERIFIABLE_SIGNATURE; sb->s_iflags |= SB_I_NOIDMAP; if (sb->s_user_ns != &init_user_ns) sb->s_iflags |= SB_I_UNTRUSTED_MOUNTER; sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION); } static int fuse_fill_super_submount(struct super_block *sb, struct fuse_inode *parent_fi) { struct fuse_mount *fm = get_fuse_mount_super(sb); struct super_block *parent_sb = parent_fi->inode.i_sb; struct fuse_attr root_attr; struct inode *root; struct fuse_submount_lookup *sl; struct fuse_inode *fi; fuse_sb_defaults(sb); fm->sb = sb; WARN_ON(sb->s_bdi != &noop_backing_dev_info); sb->s_bdi = bdi_get(parent_sb->s_bdi); sb->s_xattr = parent_sb->s_xattr; sb->s_export_op = parent_sb->s_export_op; sb->s_time_gran = parent_sb->s_time_gran; sb->s_blocksize = parent_sb->s_blocksize; sb->s_blocksize_bits = parent_sb->s_blocksize_bits; sb->s_subtype = kstrdup(parent_sb->s_subtype, GFP_KERNEL); if (parent_sb->s_subtype && !sb->s_subtype) return -ENOMEM; fuse_fill_attr_from_inode(&root_attr, parent_fi); root = fuse_iget(sb, parent_fi->nodeid, 0, &root_attr, 0, 0, fuse_get_evict_ctr(fm->fc)); /* * This inode is just a duplicate, so it is not looked up and * its nlookup should not be incremented. fuse_iget() does * that, though, so undo it here. */ fi = get_fuse_inode(root); fi->nlookup--; set_default_d_op(sb, &fuse_dentry_operations); sb->s_root = d_make_root(root); if (!sb->s_root) return -ENOMEM; /* * Grab the parent's submount_lookup pointer and take a * reference on the shared nlookup from the parent. This is to * prevent the last forget for this nodeid from getting * triggered until all users have finished with it. */ sl = parent_fi->submount_lookup; WARN_ON(!sl); if (sl) { refcount_inc(&sl->count); fi->submount_lookup = sl; } return 0; } /* Filesystem context private data holds the FUSE inode of the mount point */ static int fuse_get_tree_submount(struct fs_context *fsc) { struct fuse_mount *fm; struct fuse_inode *mp_fi = fsc->fs_private; struct fuse_conn *fc = get_fuse_conn(&mp_fi->inode); struct super_block *sb; int err; fm = kzalloc(sizeof(struct fuse_mount), GFP_KERNEL); if (!fm) return -ENOMEM; fm->fc = fuse_conn_get(fc); fsc->s_fs_info = fm; sb = sget_fc(fsc, NULL, set_anon_super_fc); if (fsc->s_fs_info) fuse_mount_destroy(fm); if (IS_ERR(sb)) return PTR_ERR(sb); /* Initialize superblock, making @mp_fi its root */ err = fuse_fill_super_submount(sb, mp_fi); if (err) { deactivate_locked_super(sb); return err; } down_write(&fc->killsb); list_add_tail(&fm->fc_entry, &fc->mounts); up_write(&fc->killsb); sb->s_flags |= SB_ACTIVE; fsc->root = dget(sb->s_root); return 0; } static const struct fs_context_operations fuse_context_submount_ops = { .get_tree = fuse_get_tree_submount, }; int fuse_init_fs_context_submount(struct fs_context *fsc) { fsc->ops = &fuse_context_submount_ops; return 0; } EXPORT_SYMBOL_GPL(fuse_init_fs_context_submount); int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx) { struct fuse_dev *fud = NULL; struct fuse_mount *fm = get_fuse_mount_super(sb); struct fuse_conn *fc = fm->fc; struct inode *root; struct dentry *root_dentry; int err; err = -EINVAL; if (sb->s_flags & SB_MANDLOCK) goto err; rcu_assign_pointer(fc->curr_bucket, fuse_sync_bucket_alloc()); fuse_sb_defaults(sb); if (ctx->is_bdev) { #ifdef CONFIG_BLOCK err = -EINVAL; if (!sb_set_blocksize(sb, ctx->blksize)) goto err; /* * This is a workaround until fuse hooks into iomap for reads. * Use PAGE_SIZE for the blocksize else if the writeback cache * is enabled, buffered writes go through iomap and a read may * overwrite partially written data if blocksize < PAGE_SIZE */ fc->blkbits = sb->s_blocksize_bits; if (ctx->blksize != PAGE_SIZE && !sb_set_blocksize(sb, PAGE_SIZE)) goto err; #endif fc->sync_fs = 1; } else { sb->s_blocksize = PAGE_SIZE; sb->s_blocksize_bits = PAGE_SHIFT; fc->blkbits = sb->s_blocksize_bits; } sb->s_subtype = ctx->subtype; ctx->subtype = NULL; if (IS_ENABLED(CONFIG_FUSE_DAX)) { err = fuse_dax_conn_alloc(fc, ctx->dax_mode, ctx->dax_dev); if (err) goto err; } if (ctx->fudptr) { err = -ENOMEM; fud = fuse_dev_alloc_install(fc); if (!fud) goto err_free_dax; } fc->dev = sb->s_dev; fm->sb = sb; err = fuse_bdi_init(fc, sb); if (err) goto err_dev_free; /* Handle umasking inside the fuse code */ if (sb->s_flags & SB_POSIXACL) fc->dont_mask = 1; sb->s_flags |= SB_POSIXACL; fc->default_permissions = ctx->default_permissions; fc->allow_other = ctx->allow_other; fc->user_id = ctx->user_id; fc->group_id = ctx->group_id; fc->legacy_opts_show = ctx->legacy_opts_show; fc->max_read = max_t(unsigned int, 4096, ctx->max_read); fc->destroy = ctx->destroy; fc->no_control = ctx->no_control; fc->no_force_umount = ctx->no_force_umount; err = -ENOMEM; root = fuse_get_root_inode(sb, ctx->rootmode); set_default_d_op(sb, &fuse_dentry_operations); root_dentry = d_make_root(root); if (!root_dentry) goto err_dev_free; mutex_lock(&fuse_mutex); err = -EINVAL; if (ctx->fudptr && *ctx->fudptr) { if (*ctx->fudptr == FUSE_DEV_SYNC_INIT) fc->sync_init = 1; else goto err_unlock; } err = fuse_ctl_add_conn(fc); if (err) goto err_unlock; list_add_tail(&fc->entry, &fuse_conn_list); sb->s_root = root_dentry; if (ctx->fudptr) { *ctx->fudptr = fud; wake_up_all(&fuse_dev_waitq); } mutex_unlock(&fuse_mutex); return 0; err_unlock: mutex_unlock(&fuse_mutex); dput(root_dentry); err_dev_free: if (fud) fuse_dev_free(fud); err_free_dax: if (IS_ENABLED(CONFIG_FUSE_DAX)) fuse_dax_conn_free(fc); err: return err; } EXPORT_SYMBOL_GPL(fuse_fill_super_common); static int fuse_fill_super(struct super_block *sb, struct fs_context *fsc) { struct fuse_fs_context *ctx = fsc->fs_private; struct fuse_mount *fm; int err; if (!ctx->file || !ctx->rootmode_present || !ctx->user_id_present || !ctx->group_id_present) return -EINVAL; /* * Require mount to happen from the same user namespace which * opened /dev/fuse to prevent potential attacks. */ if ((ctx->file->f_op != &fuse_dev_operations) || (ctx->file->f_cred->user_ns != sb->s_user_ns)) return -EINVAL; ctx->fudptr = &ctx->file->private_data; err = fuse_fill_super_common(sb, ctx); if (err) return err; /* file->private_data shall be visible on all CPUs after this */ smp_mb(); fm = get_fuse_mount_super(sb); return fuse_send_init(fm); } /* * This is the path where user supplied an already initialized fuse dev. In * this case never create a new super if the old one is gone. */ static int fuse_set_no_super(struct super_block *sb, struct fs_context *fsc) { return -ENOTCONN; } static int fuse_test_super(struct super_block *sb, struct fs_context *fsc) { return fsc->sget_key == get_fuse_conn_super(sb); } static int fuse_get_tree(struct fs_context *fsc) { struct fuse_fs_context *ctx = fsc->fs_private; struct fuse_dev *fud; struct fuse_conn *fc; struct fuse_mount *fm; struct super_block *sb; int err; fc = kmalloc(sizeof(*fc), GFP_KERNEL); if (!fc) return -ENOMEM; fm = kzalloc(sizeof(*fm), GFP_KERNEL); if (!fm) { kfree(fc); return -ENOMEM; } fuse_conn_init(fc, fm, fsc->user_ns, &fuse_dev_fiq_ops, NULL); fc->release = fuse_free_conn; fsc->s_fs_info = fm; if (ctx->fd_present) ctx->file = fget(ctx->fd); if (IS_ENABLED(CONFIG_BLOCK) && ctx->is_bdev) { err = get_tree_bdev(fsc, fuse_fill_super); goto out; } /* * While block dev mount can be initialized with a dummy device fd * (found by device name), normal fuse mounts can't */ err = -EINVAL; if (!ctx->file) goto out; /* * Allow creating a fuse mount with an already initialized fuse * connection */ fud = __fuse_get_dev(ctx->file); if (ctx->file->f_op == &fuse_dev_operations && fud) { fsc->sget_key = fud->fc; sb = sget_fc(fsc, fuse_test_super, fuse_set_no_super); err = PTR_ERR_OR_ZERO(sb); if (!IS_ERR(sb)) fsc->root = dget(sb->s_root); } else { err = get_tree_nodev(fsc, fuse_fill_super); } out: if (fsc->s_fs_info) fuse_mount_destroy(fm); if (ctx->file) fput(ctx->file); return err; } static const struct fs_context_operations fuse_context_ops = { .free = fuse_free_fsc, .parse_param = fuse_parse_param, .reconfigure = fuse_reconfigure, .get_tree = fuse_get_tree, }; /* * Set up the filesystem mount context. */ static int fuse_init_fs_context(struct fs_context *fsc) { struct fuse_fs_context *ctx; ctx = kzalloc(sizeof(struct fuse_fs_context), GFP_KERNEL); if (!ctx) return -ENOMEM; ctx->max_read = ~0; ctx->blksize = FUSE_DEFAULT_BLKSIZE; ctx->legacy_opts_show = true; #ifdef CONFIG_BLOCK if (fsc->fs_type == &fuseblk_fs_type) { ctx->is_bdev = true; ctx->destroy = true; } #endif fsc->fs_private = ctx; fsc->ops = &fuse_context_ops; return 0; } bool fuse_mount_remove(struct fuse_mount *fm) { struct fuse_conn *fc = fm->fc; bool last = false; down_write(&fc->killsb); list_del_init(&fm->fc_entry); if (list_empty(&fc->mounts)) last = true; up_write(&fc->killsb); return last; } EXPORT_SYMBOL_GPL(fuse_mount_remove); void fuse_conn_destroy(struct fuse_mount *fm) { struct fuse_conn *fc = fm->fc; if (fc->destroy) fuse_send_destroy(fm); fuse_abort_conn(fc); fuse_wait_aborted(fc); if (!list_empty(&fc->entry)) { mutex_lock(&fuse_mutex); list_del(&fc->entry); fuse_ctl_remove_conn(fc); mutex_unlock(&fuse_mutex); } } EXPORT_SYMBOL_GPL(fuse_conn_destroy); static void fuse_sb_destroy(struct super_block *sb) { struct fuse_mount *fm = get_fuse_mount_super(sb); bool last; if (sb->s_root) { last = fuse_mount_remove(fm); if (last) fuse_conn_destroy(fm); } } void fuse_mount_destroy(struct fuse_mount *fm) { fuse_conn_put(fm->fc); kfree_rcu(fm, rcu); } EXPORT_SYMBOL(fuse_mount_destroy); static void fuse_kill_sb_anon(struct super_block *sb) { fuse_sb_destroy(sb); kill_anon_super(sb); fuse_mount_destroy(get_fuse_mount_super(sb)); } static struct file_system_type fuse_fs_type = { .owner = THIS_MODULE, .name = "fuse", .fs_flags = FS_HAS_SUBTYPE | FS_USERNS_MOUNT | FS_ALLOW_IDMAP, .init_fs_context = fuse_init_fs_context, .parameters = fuse_fs_parameters, .kill_sb = fuse_kill_sb_anon, }; MODULE_ALIAS_FS("fuse"); #ifdef CONFIG_BLOCK static void fuse_kill_sb_blk(struct super_block *sb) { fuse_sb_destroy(sb); kill_block_super(sb); fuse_mount_destroy(get_fuse_mount_super(sb)); } static struct file_system_type fuseblk_fs_type = { .owner = THIS_MODULE, .name = "fuseblk", .init_fs_context = fuse_init_fs_context, .parameters = fuse_fs_parameters, .kill_sb = fuse_kill_sb_blk, .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_ALLOW_IDMAP, }; MODULE_ALIAS_FS("fuseblk"); static inline int register_fuseblk(void) { return register_filesystem(&fuseblk_fs_type); } static inline void unregister_fuseblk(void) { unregister_filesystem(&fuseblk_fs_type); } #else static inline int register_fuseblk(void) { return 0; } static inline void unregister_fuseblk(void) { } #endif static void fuse_inode_init_once(void *foo) { struct inode *inode = foo; inode_init_once(inode); } static int __init fuse_fs_init(void) { int err; fuse_inode_cachep = kmem_cache_create("fuse_inode", sizeof(struct fuse_inode), 0, SLAB_HWCACHE_ALIGN|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT, fuse_inode_init_once); err = -ENOMEM; if (!fuse_inode_cachep) goto out; err = register_fuseblk(); if (err) goto out2; err = register_filesystem(&fuse_fs_type); if (err) goto out3; err = fuse_sysctl_register(); if (err) goto out4; return 0; out4: unregister_filesystem(&fuse_fs_type); out3: unregister_fuseblk(); out2: kmem_cache_destroy(fuse_inode_cachep); out: return err; } static void fuse_fs_cleanup(void) { fuse_sysctl_unregister(); unregister_filesystem(&fuse_fs_type); unregister_fuseblk(); /* * Make sure all delayed rcu free inodes are flushed before we * destroy cache. */ rcu_barrier(); kmem_cache_destroy(fuse_inode_cachep); } static struct kobject *fuse_kobj; static int fuse_sysfs_init(void) { int err; fuse_kobj = kobject_create_and_add("fuse", fs_kobj); if (!fuse_kobj) { err = -ENOMEM; goto out_err; } err = sysfs_create_mount_point(fuse_kobj, "connections"); if (err) goto out_fuse_unregister; return 0; out_fuse_unregister: kobject_put(fuse_kobj); out_err: return err; } static void fuse_sysfs_cleanup(void) { sysfs_remove_mount_point(fuse_kobj, "connections"); kobject_put(fuse_kobj); } static int __init fuse_init(void) { int res; pr_info("init (API version %i.%i)\n", FUSE_KERNEL_VERSION, FUSE_KERNEL_MINOR_VERSION); INIT_LIST_HEAD(&fuse_conn_list); res = fuse_fs_init(); if (res) goto err; res = fuse_dev_init(); if (res) goto err_fs_cleanup; res = fuse_sysfs_init(); if (res) goto err_dev_cleanup; res = fuse_ctl_init(); if (res) goto err_sysfs_cleanup; sanitize_global_limit(&max_user_bgreq); sanitize_global_limit(&max_user_congthresh); return 0; err_sysfs_cleanup: fuse_sysfs_cleanup(); err_dev_cleanup: fuse_dev_cleanup(); err_fs_cleanup: fuse_fs_cleanup(); err: return res; } static void __exit fuse_exit(void) { pr_debug("exit\n"); fuse_ctl_cleanup(); fuse_sysfs_cleanup(); fuse_fs_cleanup(); fuse_dev_cleanup(); } module_init(fuse_init); module_exit(fuse_exit); |
| 2 2 2 2 4 4 4 4 6 5 6 6 3 3 3 3 3 3 3 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 | // SPDX-License-Identifier: GPL-2.0-only /* * Media device node * * Copyright (C) 2010 Nokia Corporation * * Based on drivers/media/video/v4l2_dev.c code authored by * Mauro Carvalho Chehab <mchehab@kernel.org> (version 2) * Alan Cox, <alan@lxorguk.ukuu.org.uk> (version 1) * * Contacts: Laurent Pinchart <laurent.pinchart@ideasonboard.com> * Sakari Ailus <sakari.ailus@iki.fi> * * -- * * Generic media device node infrastructure to register and unregister * character devices using a dynamic major number and proper reference * counting. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/errno.h> #include <linux/init.h> #include <linux/module.h> #include <linux/kernel.h> #include <linux/kmod.h> #include <linux/slab.h> #include <linux/mm.h> #include <linux/string.h> #include <linux/types.h> #include <linux/uaccess.h> #include <media/media-devnode.h> #include <media/media-device.h> #define MEDIA_NUM_DEVICES 256 #define MEDIA_NAME "media" static dev_t media_dev_t; /* * Active devices */ static DEFINE_MUTEX(media_devnode_lock); static DECLARE_BITMAP(media_devnode_nums, MEDIA_NUM_DEVICES); /* Called when the last user of the media device exits. */ static void media_devnode_release(struct device *cd) { struct media_devnode *devnode = to_media_devnode(cd); /* Release media_devnode and perform other cleanups as needed. */ if (devnode->release) devnode->release(devnode); kfree(devnode); pr_debug("%s: Media Devnode Deallocated\n", __func__); } static const struct bus_type media_bus_type = { .name = MEDIA_NAME, }; static ssize_t media_read(struct file *filp, char __user *buf, size_t sz, loff_t *off) { struct media_devnode *devnode = media_devnode_data(filp); if (!devnode->fops->read) return -EINVAL; if (!media_devnode_is_registered(devnode)) return -EIO; return devnode->fops->read(filp, buf, sz, off); } static ssize_t media_write(struct file *filp, const char __user *buf, size_t sz, loff_t *off) { struct media_devnode *devnode = media_devnode_data(filp); if (!devnode->fops->write) return -EINVAL; if (!media_devnode_is_registered(devnode)) return -EIO; return devnode->fops->write(filp, buf, sz, off); } static __poll_t media_poll(struct file *filp, struct poll_table_struct *poll) { struct media_devnode *devnode = media_devnode_data(filp); if (!media_devnode_is_registered(devnode)) return EPOLLERR | EPOLLHUP; if (!devnode->fops->poll) return DEFAULT_POLLMASK; return devnode->fops->poll(filp, poll); } static long __media_ioctl(struct file *filp, unsigned int cmd, unsigned long arg, long (*ioctl_func)(struct file *filp, unsigned int cmd, unsigned long arg)) { struct media_devnode *devnode = media_devnode_data(filp); if (!ioctl_func) return -ENOTTY; if (!media_devnode_is_registered(devnode)) return -EIO; return ioctl_func(filp, cmd, arg); } static long media_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { struct media_devnode *devnode = media_devnode_data(filp); return __media_ioctl(filp, cmd, arg, devnode->fops->ioctl); } #ifdef CONFIG_COMPAT static long media_compat_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { struct media_devnode *devnode = media_devnode_data(filp); return __media_ioctl(filp, cmd, arg, devnode->fops->compat_ioctl); } #endif /* CONFIG_COMPAT */ /* Override for the open function */ static int media_open(struct inode *inode, struct file *filp) { struct media_devnode *devnode; int ret; /* Check if the media device is available. This needs to be done with * the media_devnode_lock held to prevent an open/unregister race: * without the lock, the device could be unregistered and freed between * the media_devnode_is_registered() and get_device() calls, leading to * a crash. */ mutex_lock(&media_devnode_lock); devnode = container_of(inode->i_cdev, struct media_devnode, cdev); /* return ENXIO if the media device has been removed already or if it is not registered anymore. */ if (!media_devnode_is_registered(devnode)) { mutex_unlock(&media_devnode_lock); return -ENXIO; } /* and increase the device refcount */ get_device(&devnode->dev); mutex_unlock(&media_devnode_lock); filp->private_data = devnode; if (devnode->fops->open) { ret = devnode->fops->open(filp); if (ret) { put_device(&devnode->dev); filp->private_data = NULL; return ret; } } return 0; } /* Override for the release function */ static int media_release(struct inode *inode, struct file *filp) { struct media_devnode *devnode = media_devnode_data(filp); if (devnode->fops->release) devnode->fops->release(filp); filp->private_data = NULL; /* decrease the refcount unconditionally since the release() return value is ignored. */ put_device(&devnode->dev); return 0; } static const struct file_operations media_devnode_fops = { .owner = THIS_MODULE, .read = media_read, .write = media_write, .open = media_open, .unlocked_ioctl = media_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = media_compat_ioctl, #endif /* CONFIG_COMPAT */ .release = media_release, .poll = media_poll, }; int __must_check media_devnode_register(struct media_device *mdev, struct media_devnode *devnode, struct module *owner) { int minor; int ret; /* Part 1: Find a free minor number */ mutex_lock(&media_devnode_lock); minor = find_first_zero_bit(media_devnode_nums, MEDIA_NUM_DEVICES); if (minor == MEDIA_NUM_DEVICES) { mutex_unlock(&media_devnode_lock); pr_err("could not get a free minor\n"); kfree(devnode); return -ENFILE; } set_bit(minor, media_devnode_nums); mutex_unlock(&media_devnode_lock); devnode->minor = minor; devnode->media_dev = mdev; /* Part 1: Initialize dev now to use dev.kobj for cdev.kobj.parent */ devnode->dev.bus = &media_bus_type; devnode->dev.devt = MKDEV(MAJOR(media_dev_t), devnode->minor); devnode->dev.release = media_devnode_release; if (devnode->parent) devnode->dev.parent = devnode->parent; dev_set_name(&devnode->dev, "media%d", devnode->minor); device_initialize(&devnode->dev); /* Part 2: Initialize the character device */ cdev_init(&devnode->cdev, &media_devnode_fops); devnode->cdev.owner = owner; kobject_set_name(&devnode->cdev.kobj, "media%d", devnode->minor); /* Part 3: Add the media and char device */ set_bit(MEDIA_FLAG_REGISTERED, &devnode->flags); ret = cdev_device_add(&devnode->cdev, &devnode->dev); if (ret < 0) { clear_bit(MEDIA_FLAG_REGISTERED, &devnode->flags); pr_err("%s: cdev_device_add failed\n", __func__); goto cdev_add_error; } return 0; cdev_add_error: mutex_lock(&media_devnode_lock); clear_bit(devnode->minor, media_devnode_nums); devnode->media_dev = NULL; mutex_unlock(&media_devnode_lock); put_device(&devnode->dev); return ret; } void media_devnode_unregister_prepare(struct media_devnode *devnode) { /* Check if devnode was ever registered at all */ if (!media_devnode_is_registered(devnode)) return; mutex_lock(&media_devnode_lock); clear_bit(MEDIA_FLAG_REGISTERED, &devnode->flags); mutex_unlock(&media_devnode_lock); } void media_devnode_unregister(struct media_devnode *devnode) { mutex_lock(&media_devnode_lock); /* Delete the cdev on this minor as well */ cdev_device_del(&devnode->cdev, &devnode->dev); devnode->media_dev = NULL; clear_bit(devnode->minor, media_devnode_nums); mutex_unlock(&media_devnode_lock); put_device(&devnode->dev); } /* * Initialise media for linux */ static int __init media_devnode_init(void) { int ret; pr_info("Linux media interface: v0.10\n"); ret = alloc_chrdev_region(&media_dev_t, 0, MEDIA_NUM_DEVICES, MEDIA_NAME); if (ret < 0) { pr_warn("unable to allocate major\n"); return ret; } ret = bus_register(&media_bus_type); if (ret < 0) { unregister_chrdev_region(media_dev_t, MEDIA_NUM_DEVICES); pr_warn("bus_register failed\n"); return -EIO; } return 0; } static void __exit media_devnode_exit(void) { bus_unregister(&media_bus_type); unregister_chrdev_region(media_dev_t, MEDIA_NUM_DEVICES); } subsys_initcall(media_devnode_init); module_exit(media_devnode_exit) MODULE_AUTHOR("Laurent Pinchart <laurent.pinchart@ideasonboard.com>"); MODULE_DESCRIPTION("Device node registration for media drivers"); MODULE_LICENSE("GPL"); |
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | /* SPDX-License-Identifier: GPL-2.0 */ /* Copyright (C) B.A.T.M.A.N. contributors: * * Marek Lindner, Simon Wunderlich */ #ifndef _NET_BATMAN_ADV_ORIGINATOR_H_ #define _NET_BATMAN_ADV_ORIGINATOR_H_ #include "main.h" #include <linux/compiler.h> #include <linux/if_ether.h> #include <linux/jhash.h> #include <linux/kref.h> #include <linux/netlink.h> #include <linux/skbuff.h> #include <linux/types.h> bool batadv_compare_orig(const struct hlist_node *node, const void *data2); int batadv_originator_init(struct batadv_priv *bat_priv); void batadv_originator_free(struct batadv_priv *bat_priv); void batadv_purge_orig_ref(struct batadv_priv *bat_priv); void batadv_orig_node_release(struct kref *ref); struct batadv_orig_node *batadv_orig_node_new(struct batadv_priv *bat_priv, const u8 *addr); struct batadv_hardif_neigh_node * batadv_hardif_neigh_get(const struct batadv_hard_iface *hard_iface, const u8 *neigh_addr); void batadv_hardif_neigh_release(struct kref *ref); struct batadv_neigh_node * batadv_neigh_node_get_or_create(struct batadv_orig_node *orig_node, struct batadv_hard_iface *hard_iface, const u8 *neigh_addr); void batadv_neigh_node_release(struct kref *ref); struct batadv_neigh_node * batadv_orig_router_get(struct batadv_orig_node *orig_node, const struct batadv_hard_iface *if_outgoing); struct batadv_neigh_node * batadv_orig_to_router(struct batadv_priv *bat_priv, u8 *orig_addr, struct batadv_hard_iface *if_outgoing); struct batadv_neigh_ifinfo * batadv_neigh_ifinfo_new(struct batadv_neigh_node *neigh, struct batadv_hard_iface *if_outgoing); struct batadv_neigh_ifinfo * batadv_neigh_ifinfo_get(struct batadv_neigh_node *neigh, struct batadv_hard_iface *if_outgoing); void batadv_neigh_ifinfo_release(struct kref *ref); int batadv_hardif_neigh_dump(struct sk_buff *msg, struct netlink_callback *cb); struct batadv_orig_ifinfo * batadv_orig_ifinfo_get(struct batadv_orig_node *orig_node, struct batadv_hard_iface *if_outgoing); struct batadv_orig_ifinfo * batadv_orig_ifinfo_new(struct batadv_orig_node *orig_node, struct batadv_hard_iface *if_outgoing); void batadv_orig_ifinfo_release(struct kref *ref); int batadv_orig_dump(struct sk_buff *msg, struct netlink_callback *cb); struct batadv_orig_node_vlan * batadv_orig_node_vlan_new(struct batadv_orig_node *orig_node, unsigned short vid); struct batadv_orig_node_vlan * batadv_orig_node_vlan_get(struct batadv_orig_node *orig_node, unsigned short vid); void batadv_orig_node_vlan_release(struct kref *ref); /** * batadv_choose_orig() - Return the index of the orig entry in the hash table * @data: mac address of the originator node * @size: the size of the hash table * * Return: the hash index where the object represented by @data should be * stored at. */ static inline u32 batadv_choose_orig(const void *data, u32 size) { u32 hash = 0; hash = jhash(data, ETH_ALEN, hash); return hash % size; } struct batadv_orig_node * batadv_orig_hash_find(struct batadv_priv *bat_priv, const void *data); /** * batadv_orig_node_vlan_put() - decrement the refcounter and possibly release * the originator-vlan object * @orig_vlan: the originator-vlan object to release */ static inline void batadv_orig_node_vlan_put(struct batadv_orig_node_vlan *orig_vlan) { if (!orig_vlan) return; kref_put(&orig_vlan->refcount, batadv_orig_node_vlan_release); } /** * batadv_neigh_ifinfo_put() - decrement the refcounter and possibly release * the neigh_ifinfo * @neigh_ifinfo: the neigh_ifinfo object to release */ static inline void batadv_neigh_ifinfo_put(struct batadv_neigh_ifinfo *neigh_ifinfo) { if (!neigh_ifinfo) return; kref_put(&neigh_ifinfo->refcount, batadv_neigh_ifinfo_release); } /** * batadv_hardif_neigh_put() - decrement the hardif neighbors refcounter * and possibly release it * @hardif_neigh: hardif neigh neighbor to free */ static inline void batadv_hardif_neigh_put(struct batadv_hardif_neigh_node *hardif_neigh) { if (!hardif_neigh) return; kref_put(&hardif_neigh->refcount, batadv_hardif_neigh_release); } /** * batadv_neigh_node_put() - decrement the neighbors refcounter and possibly * release it * @neigh_node: neigh neighbor to free */ static inline void batadv_neigh_node_put(struct batadv_neigh_node *neigh_node) { if (!neigh_node) return; kref_put(&neigh_node->refcount, batadv_neigh_node_release); } /** * batadv_orig_ifinfo_put() - decrement the refcounter and possibly release * the orig_ifinfo * @orig_ifinfo: the orig_ifinfo object to release */ static inline void batadv_orig_ifinfo_put(struct batadv_orig_ifinfo *orig_ifinfo) { if (!orig_ifinfo) return; kref_put(&orig_ifinfo->refcount, batadv_orig_ifinfo_release); } /** * batadv_orig_node_put() - decrement the orig node refcounter and possibly * release it * @orig_node: the orig node to free */ static inline void batadv_orig_node_put(struct batadv_orig_node *orig_node) { if (!orig_node) return; kref_put(&orig_node->refcount, batadv_orig_node_release); } #endif /* _NET_BATMAN_ADV_ORIGINATOR_H_ */ |
| 142 162 13 13 13 11 11 10 10 147 149 149 140 141 69 69 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 | // SPDX-License-Identifier: GPL-2.0-only /* * Copyright (C) 2016 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. */ #include <linux/kernel.h> #include <linux/init.h> #include <linux/module.h> #include <linux/cache.h> #include <linux/random.h> #include <linux/hrtimer.h> #include <linux/ktime.h> #include <linux/string.h> #include <linux/net.h> #include <linux/siphash.h> #include <net/secure_seq.h> #if IS_ENABLED(CONFIG_IPV6) || IS_ENABLED(CONFIG_INET) #include <linux/in6.h> #include <net/tcp.h> static siphash_aligned_key_t net_secret; static siphash_aligned_key_t ts_secret; #define EPHEMERAL_PORT_SHUFFLE_PERIOD (10 * HZ) static __always_inline void net_secret_init(void) { net_get_random_once(&net_secret, sizeof(net_secret)); } static __always_inline void ts_secret_init(void) { net_get_random_once(&ts_secret, sizeof(ts_secret)); } #endif #ifdef CONFIG_INET static u32 seq_scale(u32 seq) { /* * As close as possible to RFC 793, which * suggests using a 250 kHz clock. * Further reading shows this assumes 2 Mb/s networks. * For 10 Mb/s Ethernet, a 1 MHz clock is appropriate. * For 10 Gb/s Ethernet, a 1 GHz clock should be ok, but * we also need to limit the resolution so that the u32 seq * overlaps less than one time per MSL (2 minutes). * Choosing a clock of 64 ns period is OK. (period of 274 s) */ return seq + (ktime_get_real_ns() >> 6); } #endif #if IS_ENABLED(CONFIG_IPV6) u32 secure_tcpv6_ts_off(const struct net *net, const __be32 *saddr, const __be32 *daddr) { const struct { struct in6_addr saddr; struct in6_addr daddr; } __aligned(SIPHASH_ALIGNMENT) combined = { .saddr = *(struct in6_addr *)saddr, .daddr = *(struct in6_addr *)daddr, }; if (READ_ONCE(net->ipv4.sysctl_tcp_timestamps) != 1) return 0; ts_secret_init(); return siphash(&combined, offsetofend(typeof(combined), daddr), &ts_secret); } EXPORT_IPV6_MOD(secure_tcpv6_ts_off); u32 secure_tcpv6_seq(const __be32 *saddr, const __be32 *daddr, __be16 sport, __be16 dport) { const struct { struct in6_addr saddr; struct in6_addr daddr; __be16 sport; __be16 dport; } __aligned(SIPHASH_ALIGNMENT) combined = { .saddr = *(struct in6_addr *)saddr, .daddr = *(struct in6_addr *)daddr, .sport = sport, .dport = dport }; u32 hash; net_secret_init(); hash = siphash(&combined, offsetofend(typeof(combined), dport), &net_secret); return seq_scale(hash); } EXPORT_SYMBOL(secure_tcpv6_seq); u64 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr, __be16 dport) { const struct { struct in6_addr saddr; struct in6_addr daddr; unsigned int timeseed; __be16 dport; } __aligned(SIPHASH_ALIGNMENT) combined = { .saddr = *(struct in6_addr *)saddr, .daddr = *(struct in6_addr *)daddr, .timeseed = jiffies / EPHEMERAL_PORT_SHUFFLE_PERIOD, .dport = dport, }; net_secret_init(); return siphash(&combined, offsetofend(typeof(combined), dport), &net_secret); } EXPORT_SYMBOL(secure_ipv6_port_ephemeral); #endif #ifdef CONFIG_INET u32 secure_tcp_ts_off(const struct net *net, __be32 saddr, __be32 daddr) { if (READ_ONCE(net->ipv4.sysctl_tcp_timestamps) != 1) return 0; ts_secret_init(); return siphash_2u32((__force u32)saddr, (__force u32)daddr, &ts_secret); } /* secure_tcp_seq_and_tsoff(a, b, 0, d) == secure_ipv4_port_ephemeral(a, b, d), * but fortunately, `sport' cannot be 0 in any circumstances. If this changes, * it would be easy enough to have the former function use siphash_4u32, passing * the arguments as separate u32. */ u32 secure_tcp_seq(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport) { u32 hash; net_secret_init(); hash = siphash_3u32((__force u32)saddr, (__force u32)daddr, (__force u32)sport << 16 | (__force u32)dport, &net_secret); return seq_scale(hash); } EXPORT_SYMBOL_GPL(secure_tcp_seq); u64 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport) { net_secret_init(); return siphash_4u32((__force u32)saddr, (__force u32)daddr, (__force u16)dport, jiffies / EPHEMERAL_PORT_SHUFFLE_PERIOD, &net_secret); } EXPORT_SYMBOL_GPL(secure_ipv4_port_ephemeral); #endif |
| 92 75 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | /* SPDX-License-Identifier: GPL-2.0-or-later */ /* SCTP kernel implementation * (C) Copyright IBM Corp. 2001, 2004 * Copyright (c) 1999-2000 Cisco, Inc. * Copyright (c) 1999-2001 Motorola, Inc. * Copyright (c) 2001 Intel Corp. * * This file is part of the SCTP kernel implementation * * These are the definitions needed for the tsnmap type. The tsnmap is used * to track out of order TSNs received. * * Please send any bug reports or fixes you make to the * email address(es): * lksctp developers <linux-sctp@vger.kernel.org> * * Written or modified by: * Jon Grimm <jgrimm@us.ibm.com> * La Monte H.P. Yarroll <piggy@acm.org> * Karl Knutson <karl@athena.chicago.il.us> * Sridhar Samudrala <sri@us.ibm.com> */ #include <net/sctp/constants.h> #ifndef __sctp_tsnmap_h__ #define __sctp_tsnmap_h__ /* RFC 2960 12.2 Parameters necessary per association (i.e. the TCB) * Mapping An array of bits or bytes indicating which out of * Array order TSN's have been received (relative to the * Last Rcvd TSN). If no gaps exist, i.e. no out of * order packets have been received, this array * will be set to all zero. This structure may be * in the form of a circular buffer or bit array. */ struct sctp_tsnmap { /* This array counts the number of chunks with each TSN. * It points at one of the two buffers with which we will * ping-pong between. */ unsigned long *tsn_map; /* This is the TSN at tsn_map[0]. */ __u32 base_tsn; /* Last Rcvd : This is the last TSN received in * TSN : sequence. This value is set initially by * : taking the peer's Initial TSN, received in * : the INIT or INIT ACK chunk, and subtracting * : one from it. * * Throughout most of the specification this is called the * "Cumulative TSN ACK Point". In this case, we * ignore the advice in 12.2 in favour of the term * used in the bulk of the text. */ __u32 cumulative_tsn_ack_point; /* This is the highest TSN we've marked. */ __u32 max_tsn_seen; /* This is the minimum number of TSNs we can track. This corresponds * to the size of tsn_map. Note: the overflow_map allows us to * potentially track more than this quantity. */ __u16 len; /* Data chunks pending receipt. used by SCTP_STATUS sockopt */ __u16 pending_data; /* Record duplicate TSNs here. We clear this after * every SACK. Store up to SCTP_MAX_DUP_TSNS worth of * information. */ __u16 num_dup_tsns; __be32 dup_tsns[SCTP_MAX_DUP_TSNS]; }; struct sctp_tsnmap_iter { __u32 start; }; /* Initialize a block of memory as a tsnmap. */ struct sctp_tsnmap *sctp_tsnmap_init(struct sctp_tsnmap *, __u16 len, __u32 initial_tsn, gfp_t gfp); void sctp_tsnmap_free(struct sctp_tsnmap *map); /* Test the tracking state of this TSN. * Returns: * 0 if the TSN has not yet been seen * >0 if the TSN has been seen (duplicate) * <0 if the TSN is invalid (too large to track) */ int sctp_tsnmap_check(const struct sctp_tsnmap *, __u32 tsn); /* Mark this TSN as seen. */ int sctp_tsnmap_mark(struct sctp_tsnmap *, __u32 tsn, struct sctp_transport *trans); /* Mark this TSN and all lower as seen. */ void sctp_tsnmap_skip(struct sctp_tsnmap *map, __u32 tsn); /* Retrieve the Cumulative TSN ACK Point. */ static inline __u32 sctp_tsnmap_get_ctsn(const struct sctp_tsnmap *map) { return map->cumulative_tsn_ack_point; } /* Retrieve the highest TSN we've seen. */ static inline __u32 sctp_tsnmap_get_max_tsn_seen(const struct sctp_tsnmap *map) { return map->max_tsn_seen; } /* How many duplicate TSNs are stored? */ static inline __u16 sctp_tsnmap_num_dups(struct sctp_tsnmap *map) { return map->num_dup_tsns; } /* Return pointer to duplicate tsn array as needed by SACK. */ static inline __be32 *sctp_tsnmap_get_dups(struct sctp_tsnmap *map) { map->num_dup_tsns = 0; return map->dup_tsns; } /* How many gap ack blocks do we have recorded? */ __u16 sctp_tsnmap_num_gabs(struct sctp_tsnmap *map, struct sctp_gap_ack_block *gabs); /* Refresh the count on pending data. */ __u16 sctp_tsnmap_pending(struct sctp_tsnmap *map); /* Is there a gap in the TSN map? */ static inline int sctp_tsnmap_has_gap(const struct sctp_tsnmap *map) { return map->cumulative_tsn_ack_point != map->max_tsn_seen; } /* Mark a duplicate TSN. Note: limit the storage of duplicate TSN * information. */ static inline void sctp_tsnmap_mark_dup(struct sctp_tsnmap *map, __u32 tsn) { if (map->num_dup_tsns < SCTP_MAX_DUP_TSNS) map->dup_tsns[map->num_dup_tsns++] = htonl(tsn); } /* Renege a TSN that was seen. */ void sctp_tsnmap_renege(struct sctp_tsnmap *, __u32 tsn); /* Is there a gap in the TSN map? */ int sctp_tsnmap_has_gap(const struct sctp_tsnmap *); #endif /* __sctp_tsnmap_h__ */ |
| 1 1 15 15 13 12 12 11 11 10 4 3 2 5 6 5 4 5 5 2 7 4 11 12 15 61 61 60 60 60 60 2 60 60 60 59 55 55 59 59 59 1 60 61 105 6 6 6 6 1 6 1 6 5 2 2 2 6 6 2 6 6 104 105 7 2 9 9 6 6 6 6 1 6 6 4 4 4 4 4 4 1 4 4 3 3 3 3 3 4 4 4 18 6 6 6 6 18 16 132 132 132 132 131 131 18 16 15 3 12 12 12 3 1 12 10 3 125 119 128 127 128 11 3 11 11 128 128 127 113 106 7 128 120 9 128 119 128 125 18 108 128 18 127 1 128 128 128 119 132 109 3 108 105 108 120 111 128 115 21 18 18 1 10 5 4 3 2 2 2 7 7 2 2 4 3 1 2 85 3 2 77 86 10 5 4 3 2 1 5 49 6 5 38 48 5 9 1 5 4 1 4 4 56 7 56 56 424 404 1 425 5 5 5 3 5 402 402 4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 | // SPDX-License-Identifier: GPL-2.0-or-later /* * RAW sockets for IPv6 * Linux INET6 implementation * * Authors: * Pedro Roque <roque@di.fc.ul.pt> * * Adapted from linux/net/ipv4/raw.c * * Fixes: * Hideaki YOSHIFUJI : sin6_scope_id support * YOSHIFUJI,H.@USAGI : raw checksum (RFC2292(bis) compliance) * Kazunori MIYAZAWA @USAGI: change process style to use ip6_append_data */ #include <linux/errno.h> #include <linux/types.h> #include <linux/socket.h> #include <linux/slab.h> #include <linux/sockios.h> #include <linux/net.h> #include <linux/in6.h> #include <linux/netdevice.h> #include <linux/if_arp.h> #include <linux/icmpv6.h> #include <linux/netfilter.h> #include <linux/netfilter_ipv6.h> #include <linux/skbuff.h> #include <linux/compat.h> #include <linux/uaccess.h> #include <asm/ioctls.h> #include <net/net_namespace.h> #include <net/ip.h> #include <net/sock.h> #include <net/snmp.h> #include <net/ipv6.h> #include <net/ndisc.h> #include <net/protocol.h> #include <net/ip6_route.h> #include <net/ip6_checksum.h> #include <net/addrconf.h> #include <net/transp_v6.h> #include <net/udp.h> #include <net/inet_common.h> #include <net/tcp_states.h> #if IS_ENABLED(CONFIG_IPV6_MIP6) #include <net/mip6.h> #endif #include <linux/mroute6.h> #include <net/raw.h> #include <net/rawv6.h> #include <net/xfrm.h> #include <linux/proc_fs.h> #include <linux/seq_file.h> #include <linux/export.h> #define ICMPV6_HDRLEN 4 /* ICMPv6 header, RFC 4443 Section 2.1 */ struct raw_hashinfo raw_v6_hashinfo; EXPORT_SYMBOL_GPL(raw_v6_hashinfo); bool raw_v6_match(struct net *net, const struct sock *sk, unsigned short num, const struct in6_addr *loc_addr, const struct in6_addr *rmt_addr, int dif, int sdif) { if (inet_sk(sk)->inet_num != num || !net_eq(sock_net(sk), net) || (!ipv6_addr_any(&sk->sk_v6_daddr) && !ipv6_addr_equal(&sk->sk_v6_daddr, rmt_addr)) || !raw_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif)) return false; if (ipv6_addr_any(&sk->sk_v6_rcv_saddr) || ipv6_addr_equal(&sk->sk_v6_rcv_saddr, loc_addr) || (ipv6_addr_is_multicast(loc_addr) && inet6_mc_check(sk, loc_addr, rmt_addr))) return true; return false; } EXPORT_SYMBOL_GPL(raw_v6_match); /* * 0 - deliver * 1 - block */ static int icmpv6_filter(const struct sock *sk, const struct sk_buff *skb) { struct icmp6hdr _hdr; const struct icmp6hdr *hdr; /* We require only the four bytes of the ICMPv6 header, not any * additional bytes of message body in "struct icmp6hdr". */ hdr = skb_header_pointer(skb, skb_transport_offset(skb), ICMPV6_HDRLEN, &_hdr); if (hdr) { const __u32 *data = &raw6_sk(sk)->filter.data[0]; unsigned int type = hdr->icmp6_type; return (data[type >> 5] & (1U << (type & 31))) != 0; } return 1; } #if IS_ENABLED(CONFIG_IPV6_MIP6) typedef int mh_filter_t(struct sock *sock, struct sk_buff *skb); static mh_filter_t __rcu *mh_filter __read_mostly; int rawv6_mh_filter_register(mh_filter_t filter) { rcu_assign_pointer(mh_filter, filter); return 0; } EXPORT_SYMBOL(rawv6_mh_filter_register); int rawv6_mh_filter_unregister(mh_filter_t filter) { RCU_INIT_POINTER(mh_filter, NULL); synchronize_rcu(); return 0; } EXPORT_SYMBOL(rawv6_mh_filter_unregister); #endif /* * demultiplex raw sockets. * (should consider queueing the skb in the sock receive_queue * without calling rawv6.c) * * Caller owns SKB so we must make clones. */ static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr) { struct net *net = dev_net(skb->dev); const struct in6_addr *saddr; const struct in6_addr *daddr; struct hlist_head *hlist; struct sock *sk; bool delivered = false; __u8 hash; saddr = &ipv6_hdr(skb)->saddr; daddr = saddr + 1; hash = raw_hashfunc(net, nexthdr); hlist = &raw_v6_hashinfo.ht[hash]; rcu_read_lock(); sk_for_each_rcu(sk, hlist) { int filtered; if (!raw_v6_match(net, sk, nexthdr, daddr, saddr, inet6_iif(skb), inet6_sdif(skb))) continue; if (atomic_read(&sk->sk_rmem_alloc) >= READ_ONCE(sk->sk_rcvbuf)) { sk_drops_inc(sk); continue; } delivered = true; switch (nexthdr) { case IPPROTO_ICMPV6: filtered = icmpv6_filter(sk, skb); break; #if IS_ENABLED(CONFIG_IPV6_MIP6) case IPPROTO_MH: { /* XXX: To validate MH only once for each packet, * this is placed here. It should be after checking * xfrm policy, however it doesn't. The checking xfrm * policy is placed in rawv6_rcv() because it is * required for each socket. */ mh_filter_t *filter; filter = rcu_dereference(mh_filter); filtered = filter ? (*filter)(sk, skb) : 0; break; } #endif default: filtered = 0; break; } if (filtered < 0) break; if (filtered == 0) { struct sk_buff *clone = skb_clone(skb, GFP_ATOMIC); /* Not releasing hash table! */ if (clone) rawv6_rcv(sk, clone); } } rcu_read_unlock(); return delivered; } bool raw6_local_deliver(struct sk_buff *skb, int nexthdr) { return ipv6_raw_deliver(skb, nexthdr); } /* This cleans up af_inet6 a bit. -DaveM */ static int rawv6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len) { struct inet_sock *inet = inet_sk(sk); struct ipv6_pinfo *np = inet6_sk(sk); struct sockaddr_in6 *addr = (struct sockaddr_in6 *) uaddr; __be32 v4addr = 0; int addr_type; int err; if (addr_len < SIN6_LEN_RFC2133) return -EINVAL; if (addr->sin6_family != AF_INET6) return -EINVAL; addr_type = ipv6_addr_type(&addr->sin6_addr); /* Raw sockets are IPv6 only */ if (addr_type == IPV6_ADDR_MAPPED) return -EADDRNOTAVAIL; lock_sock(sk); err = -EINVAL; if (sk->sk_state != TCP_CLOSE) goto out; rcu_read_lock(); /* Check if the address belongs to the host. */ if (addr_type != IPV6_ADDR_ANY) { struct net_device *dev = NULL; if (__ipv6_addr_needs_scope_id(addr_type)) { if (addr_len >= sizeof(struct sockaddr_in6) && addr->sin6_scope_id) { /* Override any existing binding, if another * one is supplied by user. */ sk->sk_bound_dev_if = addr->sin6_scope_id; } /* Binding to link-local address requires an interface */ if (!sk->sk_bound_dev_if) goto out_unlock; } if (sk->sk_bound_dev_if) { err = -ENODEV; dev = dev_get_by_index_rcu(sock_net(sk), sk->sk_bound_dev_if); if (!dev) goto out_unlock; } /* ipv4 addr of the socket is invalid. Only the * unspecified and mapped address have a v4 equivalent. */ v4addr = LOOPBACK4_IPV6; if (!(addr_type & IPV6_ADDR_MULTICAST) && !ipv6_can_nonlocal_bind(sock_net(sk), inet)) { err = -EADDRNOTAVAIL; if (!ipv6_chk_addr(sock_net(sk), &addr->sin6_addr, dev, 0)) { goto out_unlock; } } } inet->inet_rcv_saddr = inet->inet_saddr = v4addr; sk->sk_v6_rcv_saddr = addr->sin6_addr; if (!(addr_type & IPV6_ADDR_MULTICAST)) np->saddr = addr->sin6_addr; err = 0; out_unlock: rcu_read_unlock(); out: release_sock(sk); return err; } static void rawv6_err(struct sock *sk, struct sk_buff *skb, u8 type, u8 code, int offset, __be32 info) { bool recverr = inet6_test_bit(RECVERR6, sk); struct ipv6_pinfo *np = inet6_sk(sk); int err; int harderr; /* Report error on raw socket, if: 1. User requested recverr. 2. Socket is connected (otherwise the error indication is useless without recverr and error is hard. */ if (!recverr && sk->sk_state != TCP_ESTABLISHED) return; harderr = icmpv6_err_convert(type, code, &err); if (type == ICMPV6_PKT_TOOBIG) { ip6_sk_update_pmtu(skb, sk, info); harderr = (READ_ONCE(np->pmtudisc) == IPV6_PMTUDISC_DO); } if (type == NDISC_REDIRECT) { ip6_sk_redirect(skb, sk); return; } if (recverr) { u8 *payload = skb->data; if (!inet_test_bit(HDRINCL, sk)) payload += offset; ipv6_icmp_error(sk, skb, err, 0, ntohl(info), payload); } if (recverr || harderr) { sk->sk_err = err; sk_error_report(sk); } } void raw6_icmp_error(struct sk_buff *skb, int nexthdr, u8 type, u8 code, int inner_offset, __be32 info) { struct net *net = dev_net(skb->dev); struct hlist_head *hlist; struct sock *sk; int hash; hash = raw_hashfunc(net, nexthdr); hlist = &raw_v6_hashinfo.ht[hash]; rcu_read_lock(); sk_for_each_rcu(sk, hlist) { /* Note: ipv6_hdr(skb) != skb->data */ const struct ipv6hdr *ip6h = (const struct ipv6hdr *)skb->data; if (!raw_v6_match(net, sk, nexthdr, &ip6h->saddr, &ip6h->daddr, inet6_iif(skb), inet6_iif(skb))) continue; rawv6_err(sk, skb, type, code, inner_offset, info); } rcu_read_unlock(); } static inline int rawv6_rcv_skb(struct sock *sk, struct sk_buff *skb) { enum skb_drop_reason reason; if ((raw6_sk(sk)->checksum || rcu_access_pointer(sk->sk_filter)) && skb_checksum_complete(skb)) { sk_drops_inc(sk); sk_skb_reason_drop(sk, skb, SKB_DROP_REASON_SKB_CSUM); return NET_RX_DROP; } /* Charge it to the socket. */ skb_dst_drop(skb); if (sock_queue_rcv_skb_reason(sk, skb, &reason) < 0) { sk_skb_reason_drop(sk, skb, reason); return NET_RX_DROP; } return 0; } /* * This is next to useless... * if we demultiplex in network layer we don't need the extra call * just to queue the skb... * maybe we could have the network decide upon a hint if it * should call raw_rcv for demultiplexing */ int rawv6_rcv(struct sock *sk, struct sk_buff *skb) { struct inet_sock *inet = inet_sk(sk); struct raw6_sock *rp = raw6_sk(sk); if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb)) { sk_drops_inc(sk); sk_skb_reason_drop(sk, skb, SKB_DROP_REASON_XFRM_POLICY); return NET_RX_DROP; } nf_reset_ct(skb); if (!rp->checksum) skb->ip_summed = CHECKSUM_UNNECESSARY; if (skb->ip_summed == CHECKSUM_COMPLETE) { skb_postpull_rcsum(skb, skb_network_header(skb), skb_network_header_len(skb)); if (!csum_ipv6_magic(&ipv6_hdr(skb)->saddr, &ipv6_hdr(skb)->daddr, skb->len, inet->inet_num, skb->csum)) skb->ip_summed = CHECKSUM_UNNECESSARY; } if (!skb_csum_unnecessary(skb)) skb->csum = ~csum_unfold(csum_ipv6_magic(&ipv6_hdr(skb)->saddr, &ipv6_hdr(skb)->daddr, skb->len, inet->inet_num, 0)); if (inet_test_bit(HDRINCL, sk)) { if (skb_checksum_complete(skb)) { sk_drops_inc(sk); sk_skb_reason_drop(sk, skb, SKB_DROP_REASON_SKB_CSUM); return NET_RX_DROP; } } rawv6_rcv_skb(sk, skb); return 0; } /* * This should be easy, if there is something there * we return it, otherwise we block. */ static int rawv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags, int *addr_len) { struct ipv6_pinfo *np = inet6_sk(sk); DECLARE_SOCKADDR(struct sockaddr_in6 *, sin6, msg->msg_name); struct sk_buff *skb; size_t copied; int err; if (flags & MSG_OOB) return -EOPNOTSUPP; if (flags & MSG_ERRQUEUE) return ipv6_recv_error(sk, msg, len, addr_len); if (np->rxopt.bits.rxpmtu && READ_ONCE(np->rxpmtu)) return ipv6_recv_rxpmtu(sk, msg, len, addr_len); skb = skb_recv_datagram(sk, flags, &err); if (!skb) goto out; copied = skb->len; if (copied > len) { copied = len; msg->msg_flags |= MSG_TRUNC; } if (skb_csum_unnecessary(skb)) { err = skb_copy_datagram_msg(skb, 0, msg, copied); } else if (msg->msg_flags&MSG_TRUNC) { if (__skb_checksum_complete(skb)) goto csum_copy_err; err = skb_copy_datagram_msg(skb, 0, msg, copied); } else { err = skb_copy_and_csum_datagram_msg(skb, 0, msg); if (err == -EINVAL) goto csum_copy_err; } if (err) goto out_free; /* Copy the address. */ if (sin6) { sin6->sin6_family = AF_INET6; sin6->sin6_port = 0; sin6->sin6_addr = ipv6_hdr(skb)->saddr; sin6->sin6_flowinfo = 0; sin6->sin6_scope_id = ipv6_iface_scope_id(&sin6->sin6_addr, inet6_iif(skb)); *addr_len = sizeof(*sin6); } sock_recv_cmsgs(msg, sk, skb); if (np->rxopt.all) ip6_datagram_recv_ctl(sk, msg, skb); err = copied; if (flags & MSG_TRUNC) err = skb->len; out_free: skb_free_datagram(sk, skb); out: return err; csum_copy_err: skb_kill_datagram(sk, skb, flags); /* Error for blocking case is chosen to masquerade as some normal condition. */ err = (flags&MSG_DONTWAIT) ? -EAGAIN : -EHOSTUNREACH; goto out; } static int rawv6_push_pending_frames(struct sock *sk, struct flowi6 *fl6, struct raw6_sock *rp) { struct ipv6_txoptions *opt; struct sk_buff *skb; int err = 0; int offset; int len; int total_len; __wsum tmp_csum; __sum16 csum; if (!rp->checksum) goto send; skb = skb_peek(&sk->sk_write_queue); if (!skb) goto out; offset = rp->offset; total_len = inet_sk(sk)->cork.base.length; opt = inet6_sk(sk)->cork.opt; total_len -= opt ? opt->opt_flen : 0; if (offset >= total_len - 1) { err = -EINVAL; ip6_flush_pending_frames(sk); goto out; } /* should be check HW csum miyazawa */ if (skb_queue_len(&sk->sk_write_queue) == 1) { /* * Only one fragment on the socket. */ tmp_csum = skb->csum; } else { struct sk_buff *csum_skb = NULL; tmp_csum = 0; skb_queue_walk(&sk->sk_write_queue, skb) { tmp_csum = csum_add(tmp_csum, skb->csum); if (csum_skb) continue; len = skb->len - skb_transport_offset(skb); if (offset >= len) { offset -= len; continue; } csum_skb = skb; } skb = csum_skb; } offset += skb_transport_offset(skb); err = skb_copy_bits(skb, offset, &csum, 2); if (err < 0) { ip6_flush_pending_frames(sk); goto out; } /* in case cksum was not initialized */ if (unlikely(csum)) tmp_csum = csum_sub(tmp_csum, csum_unfold(csum)); csum = csum_ipv6_magic(&fl6->saddr, &fl6->daddr, total_len, fl6->flowi6_proto, tmp_csum); if (csum == 0 && fl6->flowi6_proto == IPPROTO_UDP) csum = CSUM_MANGLED_0; BUG_ON(skb_store_bits(skb, offset, &csum, 2)); send: err = ip6_push_pending_frames(sk); out: return err; } static int rawv6_send_hdrinc(struct sock *sk, struct msghdr *msg, int length, struct flowi6 *fl6, struct dst_entry **dstp, unsigned int flags, const struct sockcm_cookie *sockc) { struct net *net = sock_net(sk); struct ipv6hdr *iph; struct sk_buff *skb; int err; struct rt6_info *rt = dst_rt6_info(*dstp); int hlen = LL_RESERVED_SPACE(rt->dst.dev); int tlen = rt->dst.dev->needed_tailroom; if (length > rt->dst.dev->mtu) { ipv6_local_error(sk, EMSGSIZE, fl6, rt->dst.dev->mtu); return -EMSGSIZE; } if (length < sizeof(struct ipv6hdr)) return -EINVAL; if (flags&MSG_PROBE) goto out; skb = sock_alloc_send_skb(sk, length + hlen + tlen + 15, flags & MSG_DONTWAIT, &err); if (!skb) goto error; skb_reserve(skb, hlen); skb->protocol = htons(ETH_P_IPV6); skb->priority = sockc->priority; skb->mark = sockc->mark; skb_set_delivery_type_by_clockid(skb, sockc->transmit_time, sk->sk_clockid); skb_put(skb, length); skb_reset_network_header(skb); iph = ipv6_hdr(skb); skb->ip_summed = CHECKSUM_NONE; skb_setup_tx_timestamp(skb, sockc); if (flags & MSG_CONFIRM) skb_set_dst_pending_confirm(skb, 1); skb->transport_header = skb->network_header; err = memcpy_from_msg(iph, msg, length); if (err) { err = -EFAULT; kfree_skb(skb); goto error; } skb_dst_set(skb, &rt->dst); *dstp = NULL; /* if egress device is enslaved to an L3 master device pass the * skb to its handler for processing */ skb = l3mdev_ip6_out(sk, skb); if (unlikely(!skb)) return 0; /* Acquire rcu_read_lock() in case we need to use rt->rt6i_idev * in the error path. Since skb has been freed, the dst could * have been queued for deletion. */ rcu_read_lock(); IP6_INC_STATS(net, rt->rt6i_idev, IPSTATS_MIB_OUTREQUESTS); err = NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_OUT, net, sk, skb, NULL, rt->dst.dev, dst_output); if (err > 0) err = net_xmit_errno(err); if (err) { IP6_INC_STATS(net, rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS); rcu_read_unlock(); goto error_check; } rcu_read_unlock(); out: return 0; error: IP6_INC_STATS(net, rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS); error_check: if (err == -ENOBUFS && !inet6_test_bit(RECVERR6, sk)) err = 0; return err; } struct raw6_frag_vec { struct msghdr *msg; int hlen; char c[4]; }; static int rawv6_probe_proto_opt(struct raw6_frag_vec *rfv, struct flowi6 *fl6) { int err = 0; switch (fl6->flowi6_proto) { case IPPROTO_ICMPV6: rfv->hlen = 2; err = memcpy_from_msg(rfv->c, rfv->msg, rfv->hlen); if (!err) { fl6->fl6_icmp_type = rfv->c[0]; fl6->fl6_icmp_code = rfv->c[1]; } break; case IPPROTO_MH: rfv->hlen = 4; err = memcpy_from_msg(rfv->c, rfv->msg, rfv->hlen); if (!err) fl6->fl6_mh_type = rfv->c[2]; } return err; } static int raw6_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb) { struct raw6_frag_vec *rfv = from; if (offset < rfv->hlen) { int copy = min(rfv->hlen - offset, len); if (skb->ip_summed == CHECKSUM_PARTIAL) memcpy(to, rfv->c + offset, copy); else skb->csum = csum_block_add( skb->csum, csum_partial_copy_nocheck(rfv->c + offset, to, copy), odd); odd = 0; offset += copy; to += copy; len -= copy; if (!len) return 0; } offset -= rfv->hlen; return ip_generic_getfrag(rfv->msg, to, offset, len, odd, skb); } static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) { struct ipv6_txoptions *opt_to_free = NULL; struct ipv6_txoptions opt_space; DECLARE_SOCKADDR(struct sockaddr_in6 *, sin6, msg->msg_name); struct in6_addr *daddr, *final_p, final; struct inet_sock *inet = inet_sk(sk); struct ipv6_pinfo *np = inet6_sk(sk); struct raw6_sock *rp = raw6_sk(sk); struct ipv6_txoptions *opt = NULL; struct ip6_flowlabel *flowlabel = NULL; struct dst_entry *dst = NULL; struct raw6_frag_vec rfv; struct flowi6 fl6; struct ipcm6_cookie ipc6; int addr_len = msg->msg_namelen; int hdrincl; u16 proto; int err; /* Rough check on arithmetic overflow, better check is made in ip6_append_data(). */ if (len > INT_MAX) return -EMSGSIZE; /* Mirror BSD error message compatibility */ if (msg->msg_flags & MSG_OOB) return -EOPNOTSUPP; hdrincl = inet_test_bit(HDRINCL, sk); ipcm6_init_sk(&ipc6, sk); /* * Get and verify the address. */ memset(&fl6, 0, sizeof(fl6)); fl6.flowi6_mark = ipc6.sockc.mark; fl6.flowi6_uid = sk_uid(sk); if (sin6) { if (addr_len < SIN6_LEN_RFC2133) return -EINVAL; if (sin6->sin6_family && sin6->sin6_family != AF_INET6) return -EAFNOSUPPORT; /* port is the proto value [0..255] carried in nexthdr */ proto = ntohs(sin6->sin6_port); if (!proto) proto = inet->inet_num; else if (proto != inet->inet_num && inet->inet_num != IPPROTO_RAW) return -EINVAL; if (proto > 255) return -EINVAL; daddr = &sin6->sin6_addr; if (inet6_test_bit(SNDFLOW, sk)) { fl6.flowlabel = sin6->sin6_flowinfo&IPV6_FLOWINFO_MASK; if (fl6.flowlabel&IPV6_FLOWLABEL_MASK) { flowlabel = fl6_sock_lookup(sk, fl6.flowlabel); if (IS_ERR(flowlabel)) return -EINVAL; } } /* * Otherwise it will be difficult to maintain * sk->sk_dst_cache. */ if (sk->sk_state == TCP_ESTABLISHED && ipv6_addr_equal(daddr, &sk->sk_v6_daddr)) daddr = &sk->sk_v6_daddr; if (addr_len >= sizeof(struct sockaddr_in6) && sin6->sin6_scope_id && __ipv6_addr_needs_scope_id(__ipv6_addr_type(daddr))) fl6.flowi6_oif = sin6->sin6_scope_id; } else { if (sk->sk_state != TCP_ESTABLISHED) return -EDESTADDRREQ; proto = inet->inet_num; daddr = &sk->sk_v6_daddr; fl6.flowlabel = np->flow_label; } if (fl6.flowi6_oif == 0) fl6.flowi6_oif = sk->sk_bound_dev_if; if (msg->msg_controllen) { opt = &opt_space; memset(opt, 0, sizeof(struct ipv6_txoptions)); opt->tot_len = sizeof(struct ipv6_txoptions); ipc6.opt = opt; err = ip6_datagram_send_ctl(sock_net(sk), sk, msg, &fl6, &ipc6); if (err < 0) { fl6_sock_release(flowlabel); return err; } if ((fl6.flowlabel&IPV6_FLOWLABEL_MASK) && !flowlabel) { flowlabel = fl6_sock_lookup(sk, fl6.flowlabel); if (IS_ERR(flowlabel)) return -EINVAL; } if (!(opt->opt_nflen|opt->opt_flen)) opt = NULL; } if (!opt) { opt = txopt_get(np); opt_to_free = opt; } if (flowlabel) opt = fl6_merge_options(&opt_space, flowlabel, opt); opt = ipv6_fixup_options(&opt_space, opt); fl6.flowi6_proto = proto; fl6.flowi6_mark = ipc6.sockc.mark; if (!hdrincl) { rfv.msg = msg; rfv.hlen = 0; err = rawv6_probe_proto_opt(&rfv, &fl6); if (err) goto out; } if (!ipv6_addr_any(daddr)) fl6.daddr = *daddr; else fl6.daddr.s6_addr[15] = 0x1; /* :: means loopback (BSD'ism) */ if (ipv6_addr_any(&fl6.saddr) && !ipv6_addr_any(&np->saddr)) fl6.saddr = np->saddr; final_p = fl6_update_dst(&fl6, opt, &final); if (!fl6.flowi6_oif && ipv6_addr_is_multicast(&fl6.daddr)) fl6.flowi6_oif = READ_ONCE(np->mcast_oif); else if (!fl6.flowi6_oif) fl6.flowi6_oif = READ_ONCE(np->ucast_oif); security_sk_classify_flow(sk, flowi6_to_flowi_common(&fl6)); if (hdrincl) fl6.flowi6_flags |= FLOWI_FLAG_KNOWN_NH; fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel); dst = ip6_dst_lookup_flow(sock_net(sk), sk, &fl6, final_p); if (IS_ERR(dst)) { err = PTR_ERR(dst); goto out; } if (ipc6.hlimit < 0) ipc6.hlimit = ip6_sk_dst_hoplimit(np, &fl6, dst); if (msg->msg_flags&MSG_CONFIRM) goto do_confirm; back_from_confirm: if (hdrincl) err = rawv6_send_hdrinc(sk, msg, len, &fl6, &dst, msg->msg_flags, &ipc6.sockc); else { ipc6.opt = opt; lock_sock(sk); err = ip6_append_data(sk, raw6_getfrag, &rfv, len, 0, &ipc6, &fl6, dst_rt6_info(dst), msg->msg_flags); if (err) ip6_flush_pending_frames(sk); else if (!(msg->msg_flags & MSG_MORE)) err = rawv6_push_pending_frames(sk, &fl6, rp); release_sock(sk); } done: dst_release(dst); out: fl6_sock_release(flowlabel); txopt_put(opt_to_free); return err < 0 ? err : len; do_confirm: if (msg->msg_flags & MSG_PROBE) dst_confirm_neigh(dst, &fl6.daddr); if (!(msg->msg_flags & MSG_PROBE) || len) goto back_from_confirm; err = 0; goto done; } static int rawv6_seticmpfilter(struct sock *sk, int optname, sockptr_t optval, int optlen) { switch (optname) { case ICMPV6_FILTER: if (optlen > sizeof(struct icmp6_filter)) optlen = sizeof(struct icmp6_filter); if (copy_from_sockptr(&raw6_sk(sk)->filter, optval, optlen)) return -EFAULT; return 0; default: return -ENOPROTOOPT; } return 0; } static int rawv6_geticmpfilter(struct sock *sk, int optname, char __user *optval, int __user *optlen) { int len; switch (optname) { case ICMPV6_FILTER: if (get_user(len, optlen)) return -EFAULT; if (len < 0) return -EINVAL; if (len > sizeof(struct icmp6_filter)) len = sizeof(struct icmp6_filter); if (put_user(len, optlen)) return -EFAULT; if (copy_to_user(optval, &raw6_sk(sk)->filter, len)) return -EFAULT; return 0; default: return -ENOPROTOOPT; } return 0; } static int do_rawv6_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval, unsigned int optlen) { struct raw6_sock *rp = raw6_sk(sk); int val; if (optlen < sizeof(val)) return -EINVAL; if (copy_from_sockptr(&val, optval, sizeof(val))) return -EFAULT; switch (optname) { case IPV6_HDRINCL: if (sk->sk_type != SOCK_RAW) return -EINVAL; inet_assign_bit(HDRINCL, sk, val); return 0; case IPV6_CHECKSUM: if (inet_sk(sk)->inet_num == IPPROTO_ICMPV6 && level == IPPROTO_IPV6) { /* * RFC3542 tells that IPV6_CHECKSUM socket * option in the IPPROTO_IPV6 level is not * allowed on ICMPv6 sockets. * If you want to set it, use IPPROTO_RAW * level IPV6_CHECKSUM socket option * (Linux extension). */ return -EINVAL; } /* You may get strange result with a positive odd offset; RFC2292bis agrees with me. */ if (val > 0 && (val&1)) return -EINVAL; if (val < 0) { rp->checksum = 0; } else { rp->checksum = 1; rp->offset = val; } return 0; default: return -ENOPROTOOPT; } } static int rawv6_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval, unsigned int optlen) { switch (level) { case SOL_RAW: break; case SOL_ICMPV6: if (inet_sk(sk)->inet_num != IPPROTO_ICMPV6) return -EOPNOTSUPP; return rawv6_seticmpfilter(sk, optname, optval, optlen); case SOL_IPV6: if (optname == IPV6_CHECKSUM || optname == IPV6_HDRINCL) break; fallthrough; default: return ipv6_setsockopt(sk, level, optname, optval, optlen); } return do_rawv6_setsockopt(sk, level, optname, optval, optlen); } static int do_rawv6_getsockopt(struct sock *sk, int level, int optname, char __user *optval, int __user *optlen) { struct raw6_sock *rp = raw6_sk(sk); int val, len; if (get_user(len, optlen)) return -EFAULT; switch (optname) { case IPV6_HDRINCL: val = inet_test_bit(HDRINCL, sk); break; case IPV6_CHECKSUM: /* * We allow getsockopt() for IPPROTO_IPV6-level * IPV6_CHECKSUM socket option on ICMPv6 sockets * since RFC3542 is silent about it. */ if (rp->checksum == 0) val = -1; else val = rp->offset; break; default: return -ENOPROTOOPT; } len = min_t(unsigned int, sizeof(int), len); if (put_user(len, optlen)) return -EFAULT; if (copy_to_user(optval, &val, len)) return -EFAULT; return 0; } static int rawv6_getsockopt(struct sock *sk, int level, int optname, char __user *optval, int __user *optlen) { switch (level) { case SOL_RAW: break; case SOL_ICMPV6: if (inet_sk(sk)->inet_num != IPPROTO_ICMPV6) return -EOPNOTSUPP; return rawv6_geticmpfilter(sk, optname, optval, optlen); case SOL_IPV6: if (optname == IPV6_CHECKSUM || optname == IPV6_HDRINCL) break; fallthrough; default: return ipv6_getsockopt(sk, level, optname, optval, optlen); } return do_rawv6_getsockopt(sk, level, optname, optval, optlen); } static int rawv6_ioctl(struct sock *sk, int cmd, int *karg) { switch (cmd) { case SIOCOUTQ: { *karg = sk_wmem_alloc_get(sk); return 0; } case SIOCINQ: { struct sk_buff *skb; spin_lock_bh(&sk->sk_receive_queue.lock); skb = skb_peek(&sk->sk_receive_queue); if (skb) *karg = skb->len; else *karg = 0; spin_unlock_bh(&sk->sk_receive_queue.lock); return 0; } default: #ifdef CONFIG_IPV6_MROUTE return ip6mr_ioctl(sk, cmd, karg); #else return -ENOIOCTLCMD; #endif } } #ifdef CONFIG_COMPAT static int compat_rawv6_ioctl(struct sock *sk, unsigned int cmd, unsigned long arg) { switch (cmd) { case SIOCOUTQ: case SIOCINQ: return -ENOIOCTLCMD; default: #ifdef CONFIG_IPV6_MROUTE return ip6mr_compat_ioctl(sk, cmd, compat_ptr(arg)); #else return -ENOIOCTLCMD; #endif } } #endif static void rawv6_close(struct sock *sk, long timeout) { if (inet_sk(sk)->inet_num == IPPROTO_RAW) ip6_ra_control(sk, -1); ip6mr_sk_done(sk); sk_common_release(sk); } static void raw6_destroy(struct sock *sk) { lock_sock(sk); ip6_flush_pending_frames(sk); release_sock(sk); } static int rawv6_init_sk(struct sock *sk) { struct raw6_sock *rp = raw6_sk(sk); sk->sk_drop_counters = &rp->drop_counters; switch (inet_sk(sk)->inet_num) { case IPPROTO_ICMPV6: rp->checksum = 1; rp->offset = 2; break; case IPPROTO_MH: rp->checksum = 1; rp->offset = 4; break; default: break; } return 0; } struct proto rawv6_prot = { .name = "RAWv6", .owner = THIS_MODULE, .close = rawv6_close, .destroy = raw6_destroy, .connect = ip6_datagram_connect_v6_only, .disconnect = __udp_disconnect, .ioctl = rawv6_ioctl, .init = rawv6_init_sk, .setsockopt = rawv6_setsockopt, .getsockopt = rawv6_getsockopt, .sendmsg = rawv6_sendmsg, .recvmsg = rawv6_recvmsg, .bind = rawv6_bind, .backlog_rcv = rawv6_rcv_skb, .hash = raw_hash_sk, .unhash = raw_unhash_sk, .obj_size = sizeof(struct raw6_sock), .ipv6_pinfo_offset = offsetof(struct raw6_sock, inet6), .useroffset = offsetof(struct raw6_sock, filter), .usersize = sizeof_field(struct raw6_sock, filter), .h.raw_hash = &raw_v6_hashinfo, #ifdef CONFIG_COMPAT .compat_ioctl = compat_rawv6_ioctl, #endif .diag_destroy = raw_abort, }; #ifdef CONFIG_PROC_FS static int raw6_seq_show(struct seq_file *seq, void *v) { if (v == SEQ_START_TOKEN) { seq_puts(seq, IPV6_SEQ_DGRAM_HEADER); } else { struct sock *sp = v; __u16 srcp = inet_sk(sp)->inet_num; ip6_dgram_sock_seq_show(seq, v, srcp, 0, raw_seq_private(seq)->bucket); } return 0; } static const struct seq_operations raw6_seq_ops = { .start = raw_seq_start, .next = raw_seq_next, .stop = raw_seq_stop, .show = raw6_seq_show, }; static int __net_init raw6_init_net(struct net *net) { if (!proc_create_net_data("raw6", 0444, net->proc_net, &raw6_seq_ops, sizeof(struct raw_iter_state), &raw_v6_hashinfo)) return -ENOMEM; return 0; } static void __net_exit raw6_exit_net(struct net *net) { remove_proc_entry("raw6", net->proc_net); } static struct pernet_operations raw6_net_ops = { .init = raw6_init_net, .exit = raw6_exit_net, }; int __init raw6_proc_init(void) { return register_pernet_subsys(&raw6_net_ops); } void raw6_proc_exit(void) { unregister_pernet_subsys(&raw6_net_ops); } #endif /* CONFIG_PROC_FS */ /* Same as inet6_dgram_ops, sans udp_poll. */ const struct proto_ops inet6_sockraw_ops = { .family = PF_INET6, .owner = THIS_MODULE, .release = inet6_release, .bind = inet6_bind, .connect = inet_dgram_connect, /* ok */ .socketpair = sock_no_socketpair, /* a do nothing */ .accept = sock_no_accept, /* a do nothing */ .getname = inet6_getname, .poll = datagram_poll, /* ok */ .ioctl = inet6_ioctl, /* must change */ .gettstamp = sock_gettstamp, .listen = sock_no_listen, /* ok */ .shutdown = inet_shutdown, /* ok */ .setsockopt = sock_common_setsockopt, /* ok */ .getsockopt = sock_common_getsockopt, /* ok */ .sendmsg = inet_sendmsg, /* ok */ .recvmsg = sock_common_recvmsg, /* ok */ .mmap = sock_no_mmap, #ifdef CONFIG_COMPAT .compat_ioctl = inet6_compat_ioctl, #endif }; static struct inet_protosw rawv6_protosw = { .type = SOCK_RAW, .protocol = IPPROTO_IP, /* wild card */ .prot = &rawv6_prot, .ops = &inet6_sockraw_ops, .flags = INET_PROTOSW_REUSE, }; int __init rawv6_init(void) { return inet6_register_protosw(&rawv6_protosw); } void rawv6_exit(void) { inet6_unregister_protosw(&rawv6_protosw); } |
| 6062 6029 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _ASM_X86_COMPAT_H #define _ASM_X86_COMPAT_H /* * Architecture specific compatibility types */ #include <linux/types.h> #include <linux/sched.h> #include <linux/sched/task_stack.h> #include <asm/processor.h> #include <asm/user32.h> #include <asm/unistd.h> #define compat_mode_t compat_mode_t typedef u16 compat_mode_t; #define __compat_uid_t __compat_uid_t typedef u16 __compat_uid_t; typedef u16 __compat_gid_t; #define compat_dev_t compat_dev_t typedef u16 compat_dev_t; #define compat_ipc_pid_t compat_ipc_pid_t typedef u16 compat_ipc_pid_t; #define compat_statfs compat_statfs #include <asm-generic/compat.h> #define COMPAT_UTS_MACHINE "i686\0\0" typedef u16 compat_nlink_t; struct compat_stat { u32 st_dev; compat_ino_t st_ino; compat_mode_t st_mode; compat_nlink_t st_nlink; __compat_uid_t st_uid; __compat_gid_t st_gid; u32 st_rdev; u32 st_size; u32 st_blksize; u32 st_blocks; u32 st_atime; u32 st_atime_nsec; u32 st_mtime; u32 st_mtime_nsec; u32 st_ctime; u32 st_ctime_nsec; u32 __unused4; u32 __unused5; }; /* * IA32 uses 4 byte alignment for 64 bit quantities, so we need to pack the * compat flock64 structure. */ #define __ARCH_NEED_COMPAT_FLOCK64_PACKED struct compat_statfs { int f_type; int f_bsize; int f_blocks; int f_bfree; int f_bavail; int f_files; int f_ffree; compat_fsid_t f_fsid; int f_namelen; /* SunOS ignores this field. */ int f_frsize; int f_flags; int f_spare[4]; }; #ifdef CONFIG_X86_X32_ABI #define COMPAT_USE_64BIT_TIME \ (!!(task_pt_regs(current)->orig_ax & __X32_SYSCALL_BIT)) #endif static inline bool in_x32_syscall(void) { #ifdef CONFIG_X86_X32_ABI if (task_pt_regs(current)->orig_ax & __X32_SYSCALL_BIT) return true; #endif return false; } static inline bool in_32bit_syscall(void) { return in_ia32_syscall() || in_x32_syscall(); } #ifdef CONFIG_COMPAT static inline bool in_compat_syscall(void) { return in_32bit_syscall(); } #define in_compat_syscall in_compat_syscall /* override the generic impl */ #define compat_need_64bit_alignment_fixup in_ia32_syscall #endif struct compat_siginfo; #ifdef CONFIG_X86_X32_ABI int copy_siginfo_to_user32(struct compat_siginfo __user *to, const kernel_siginfo_t *from); #define copy_siginfo_to_user32 copy_siginfo_to_user32 #endif /* CONFIG_X86_X32_ABI */ #endif /* _ASM_X86_COMPAT_H */ |
| 350 38 38 223 214 208 10 233 231 214 214 213 1 233 233 5 78 78 15 15 5 1 202 202 202 201 202 196 316 354 18 21 18 18 10 10 306 308 308 308 308 308 308 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _NET_IP6_ROUTE_H #define _NET_IP6_ROUTE_H #include <net/addrconf.h> #include <net/flow.h> #include <net/ip6_fib.h> #include <net/sock.h> #include <net/lwtunnel.h> #include <linux/ip.h> #include <linux/ipv6.h> #include <linux/route.h> #include <net/nexthop.h> struct route_info { __u8 type; __u8 length; __u8 prefix_len; #if defined(__BIG_ENDIAN_BITFIELD) __u8 reserved_h:3, route_pref:2, reserved_l:3; #elif defined(__LITTLE_ENDIAN_BITFIELD) __u8 reserved_l:3, route_pref:2, reserved_h:3; #endif __be32 lifetime; __u8 prefix[]; /* 0,8 or 16 */ }; #define RT6_LOOKUP_F_IFACE 0x00000001 #define RT6_LOOKUP_F_REACHABLE 0x00000002 #define RT6_LOOKUP_F_HAS_SADDR 0x00000004 #define RT6_LOOKUP_F_SRCPREF_TMP 0x00000008 #define RT6_LOOKUP_F_SRCPREF_PUBLIC 0x00000010 #define RT6_LOOKUP_F_SRCPREF_COA 0x00000020 #define RT6_LOOKUP_F_IGNORE_LINKSTATE 0x00000040 #define RT6_LOOKUP_F_DST_NOREF 0x00000080 /* We do not (yet ?) support IPv6 jumbograms (RFC 2675) * Unlike IPv4, hdr->seg_len doesn't include the IPv6 header */ #define IP6_MAX_MTU (0xFFFF + sizeof(struct ipv6hdr)) /* * rt6_srcprefs2flags() and rt6_flags2srcprefs() translate * between IPV6_ADDR_PREFERENCES socket option values * IPV6_PREFER_SRC_TMP = 0x1 * IPV6_PREFER_SRC_PUBLIC = 0x2 * IPV6_PREFER_SRC_COA = 0x4 * and above RT6_LOOKUP_F_SRCPREF_xxx flags. */ static inline int rt6_srcprefs2flags(unsigned int srcprefs) { return (srcprefs & IPV6_PREFER_SRC_MASK) << 3; } static inline unsigned int rt6_flags2srcprefs(int flags) { return (flags >> 3) & IPV6_PREFER_SRC_MASK; } static inline bool rt6_need_strict(const struct in6_addr *daddr) { return ipv6_addr_type(daddr) & (IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL | IPV6_ADDR_LOOPBACK); } /* fib entries using a nexthop object can not be coalesced into * a multipath route */ static inline bool rt6_qualify_for_ecmp(const struct fib6_info *f6i) { /* the RTF_ADDRCONF flag filters out RA's */ return !(f6i->fib6_flags & RTF_ADDRCONF) && !f6i->nh && f6i->fib6_nh->fib_nh_gw_family; } void ip6_route_input(struct sk_buff *skb); struct dst_entry *ip6_route_input_lookup(struct net *net, struct net_device *dev, struct flowi6 *fl6, const struct sk_buff *skb, int flags); struct dst_entry *ip6_route_output_flags(struct net *net, const struct sock *sk, struct flowi6 *fl6, int flags); static inline struct dst_entry *ip6_route_output(struct net *net, const struct sock *sk, struct flowi6 *fl6) { return ip6_route_output_flags(net, sk, fl6, 0); } /* Only conditionally release dst if flags indicates * !RT6_LOOKUP_F_DST_NOREF or dst is in uncached_list. */ static inline void ip6_rt_put_flags(struct rt6_info *rt, int flags) { if (!(flags & RT6_LOOKUP_F_DST_NOREF) || !list_empty(&rt->dst.rt_uncached)) ip6_rt_put(rt); } struct dst_entry *ip6_route_lookup(struct net *net, struct flowi6 *fl6, const struct sk_buff *skb, int flags); struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table, int ifindex, struct flowi6 *fl6, const struct sk_buff *skb, int flags); void ip6_route_init_special_entries(void); int ip6_route_init(void); void ip6_route_cleanup(void); int ipv6_route_ioctl(struct net *net, unsigned int cmd, struct in6_rtmsg *rtmsg); int ip6_route_add(struct fib6_config *cfg, gfp_t gfp_flags, struct netlink_ext_ack *extack); int ip6_ins_rt(struct net *net, struct fib6_info *f6i); int ip6_del_rt(struct net *net, struct fib6_info *f6i, bool skip_notify); void rt6_flush_exceptions(struct fib6_info *f6i); void rt6_age_exceptions(struct fib6_info *f6i, struct fib6_gc_args *gc_args, unsigned long now); static inline int ip6_route_get_saddr(struct net *net, struct fib6_info *f6i, const struct in6_addr *daddr, unsigned int prefs, int l3mdev_index, struct in6_addr *saddr) { struct net_device *l3mdev; struct net_device *dev; bool same_vrf; int err = 0; rcu_read_lock(); l3mdev = dev_get_by_index_rcu(net, l3mdev_index); if (!f6i || !f6i->fib6_prefsrc.plen || l3mdev) dev = f6i ? fib6_info_nh_dev(f6i) : NULL; same_vrf = !l3mdev || l3mdev_master_dev_rcu(dev) == l3mdev; if (f6i && f6i->fib6_prefsrc.plen && same_vrf) *saddr = f6i->fib6_prefsrc.addr; else err = ipv6_dev_get_saddr(net, same_vrf ? dev : l3mdev, daddr, prefs, saddr); rcu_read_unlock(); return err; } struct rt6_info *rt6_lookup(struct net *net, const struct in6_addr *daddr, const struct in6_addr *saddr, int oif, const struct sk_buff *skb, int flags); u32 rt6_multipath_hash(const struct net *net, const struct flowi6 *fl6, const struct sk_buff *skb, struct flow_keys *hkeys); struct dst_entry *icmp6_dst_alloc(struct net_device *dev, struct flowi6 *fl6); void fib6_force_start_gc(struct net *net); struct fib6_info *addrconf_f6i_alloc(struct net *net, struct inet6_dev *idev, const struct in6_addr *addr, bool anycast, gfp_t gfp_flags, struct netlink_ext_ack *extack); struct rt6_info *ip6_dst_alloc(struct net *net, struct net_device *dev, int flags); /* * support functions for ND * */ struct fib6_info *rt6_get_dflt_router(struct net *net, const struct in6_addr *addr, struct net_device *dev); struct fib6_info *rt6_add_dflt_router(struct net *net, const struct in6_addr *gwaddr, struct net_device *dev, unsigned int pref, u32 defrtr_usr_metric, int lifetime); void rt6_purge_dflt_routers(struct net *net); int rt6_route_rcv(struct net_device *dev, u8 *opt, int len, const struct in6_addr *gwaddr); void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu, int oif, u32 mark, kuid_t uid); void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu); void ip6_redirect(struct sk_buff *skb, struct net *net, int oif, u32 mark, kuid_t uid); void ip6_redirect_no_header(struct sk_buff *skb, struct net *net, int oif); void ip6_sk_redirect(struct sk_buff *skb, struct sock *sk); struct netlink_callback; struct rt6_rtnl_dump_arg { struct sk_buff *skb; struct netlink_callback *cb; struct net *net; struct fib_dump_filter filter; }; int rt6_dump_route(struct fib6_info *f6i, void *p_arg, unsigned int skip); void rt6_mtu_change(struct net_device *dev, unsigned int mtu); void rt6_remove_prefsrc(struct inet6_ifaddr *ifp); void rt6_clean_tohost(struct net *net, struct in6_addr *gateway); void rt6_sync_up(struct net_device *dev, unsigned char nh_flags); void rt6_disable_ip(struct net_device *dev, unsigned long event); void rt6_sync_down_dev(struct net_device *dev, unsigned long event); void rt6_multipath_rebalance(struct fib6_info *f6i); void rt6_uncached_list_add(struct rt6_info *rt); void rt6_uncached_list_del(struct rt6_info *rt); static inline const struct rt6_info *skb_rt6_info(const struct sk_buff *skb) { const struct dst_entry *dst = skb_dst(skb); if (dst) return dst_rt6_info(dst); return NULL; } /* * Store a destination cache entry in a socket */ static inline void ip6_dst_store(struct sock *sk, struct dst_entry *dst, bool daddr_set, bool saddr_set) { struct ipv6_pinfo *np = inet6_sk(sk); np->dst_cookie = rt6_get_cookie(dst_rt6_info(dst)); sk_setup_caps(sk, dst); np->daddr_cache = daddr_set; #ifdef CONFIG_IPV6_SUBTREES np->saddr_cache = saddr_set; #endif } void ip6_sk_dst_store_flow(struct sock *sk, struct dst_entry *dst, const struct flowi6 *fl6); static inline bool ipv6_unicast_destination(const struct sk_buff *skb) { const struct rt6_info *rt = dst_rt6_info(skb_dst(skb)); return rt->rt6i_flags & RTF_LOCAL; } static inline bool ipv6_anycast_destination(const struct dst_entry *dst, const struct in6_addr *daddr) { const struct rt6_info *rt = dst_rt6_info(dst); return rt->rt6i_flags & RTF_ANYCAST || (rt->rt6i_dst.plen < 127 && !(rt->rt6i_flags & (RTF_GATEWAY | RTF_NONEXTHOP)) && ipv6_addr_equal(&rt->rt6i_dst.addr, daddr)); } int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, int (*output)(struct net *, struct sock *, struct sk_buff *)); static inline unsigned int ip6_skb_dst_mtu(const struct sk_buff *skb) { const struct ipv6_pinfo *np = skb->sk && !dev_recursion_level() ? inet6_sk(skb->sk) : NULL; const struct dst_entry *dst = skb_dst(skb); unsigned int mtu; if (np && READ_ONCE(np->pmtudisc) >= IPV6_PMTUDISC_PROBE) { mtu = READ_ONCE(dst_dev(dst)->mtu); mtu -= lwtunnel_headroom(dst->lwtstate, mtu); } else { mtu = dst_mtu(dst); } return mtu; } static inline bool ip6_sk_accept_pmtu(const struct sock *sk) { u8 pmtudisc = READ_ONCE(inet6_sk(sk)->pmtudisc); return pmtudisc != IPV6_PMTUDISC_INTERFACE && pmtudisc != IPV6_PMTUDISC_OMIT; } static inline bool ip6_sk_ignore_df(const struct sock *sk) { u8 pmtudisc = READ_ONCE(inet6_sk(sk)->pmtudisc); return pmtudisc < IPV6_PMTUDISC_DO || pmtudisc == IPV6_PMTUDISC_OMIT; } static inline const struct in6_addr *rt6_nexthop(const struct rt6_info *rt, const struct in6_addr *daddr) { if (rt->rt6i_flags & RTF_GATEWAY) return &rt->rt6i_gateway; else if (unlikely(rt->rt6i_flags & RTF_CACHE)) return &rt->rt6i_dst.addr; else return daddr; } static inline bool rt6_duplicate_nexthop(struct fib6_info *a, struct fib6_info *b) { struct fib6_nh *nha, *nhb; if (a->nh || b->nh) return nexthop_cmp(a->nh, b->nh); nha = a->fib6_nh; nhb = b->fib6_nh; return nha->fib_nh_dev == nhb->fib_nh_dev && ipv6_addr_equal(&nha->fib_nh_gw6, &nhb->fib_nh_gw6) && !lwtunnel_cmp_encap(nha->fib_nh_lws, nhb->fib_nh_lws); } static inline unsigned int ip6_dst_mtu_maybe_forward(const struct dst_entry *dst, bool forwarding) { struct inet6_dev *idev; unsigned int mtu; if (!forwarding || dst_metric_locked(dst, RTAX_MTU)) { mtu = dst_metric_raw(dst, RTAX_MTU); if (mtu) goto out; } mtu = IPV6_MIN_MTU; rcu_read_lock(); idev = __in6_dev_get(dst_dev_rcu(dst)); if (idev) mtu = READ_ONCE(idev->cnf.mtu6); rcu_read_unlock(); out: return mtu - lwtunnel_headroom(dst->lwtstate, mtu); } u32 ip6_mtu_from_fib6(const struct fib6_result *res, const struct in6_addr *daddr, const struct in6_addr *saddr); struct neighbour *ip6_neigh_lookup(const struct in6_addr *gw, struct net_device *dev, struct sk_buff *skb, const void *daddr); #endif |
| 5 5 5 3 3 5 2 2 2 5 2 5 41 16 14 14 13 4 48 49 47 49 40 41 16 1 1 1 1 2 2 1 2 2 2 2 1 1 2 2 1 1 2 1 2 2 2 1 1 1 3 3 3 1 1 2 2 2 2 2 2 5 2 7 8 7 6 5 5 5 2 5 8 21 4 21 3 2 2 17 8 9 9 9 8 6 2 1 1 19 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Ioctl handler * Linux ethernet bridge * * Authors: * Lennert Buytenhek <buytenh@gnu.org> */ #include <linux/capability.h> #include <linux/compat.h> #include <linux/kernel.h> #include <linux/if_bridge.h> #include <linux/netdevice.h> #include <linux/slab.h> #include <linux/times.h> #include <net/net_namespace.h> #include <linux/uaccess.h> #include "br_private.h" static int get_bridge_ifindices(struct net *net, int *indices, int num) { struct net_device *dev; int i = 0; rcu_read_lock(); for_each_netdev_rcu(net, dev) { if (i >= num) break; if (netif_is_bridge_master(dev)) indices[i++] = dev->ifindex; } rcu_read_unlock(); return i; } /* called with RTNL */ static void get_port_ifindices(struct net_bridge *br, int *ifindices, int num) { struct net_bridge_port *p; list_for_each_entry(p, &br->port_list, list) { if (p->port_no < num) ifindices[p->port_no] = p->dev->ifindex; } } /* * Format up to a page worth of forwarding table entries * userbuf -- where to copy result * maxnum -- maximum number of entries desired * (limited to a page for sanity) * offset -- number of records to skip */ static int get_fdb_entries(struct net_bridge *br, void __user *userbuf, unsigned long maxnum, unsigned long offset) { int num; void *buf; size_t size; /* Clamp size to PAGE_SIZE, test maxnum to avoid overflow */ if (maxnum > PAGE_SIZE/sizeof(struct __fdb_entry)) maxnum = PAGE_SIZE/sizeof(struct __fdb_entry); size = maxnum * sizeof(struct __fdb_entry); buf = kmalloc(size, GFP_USER); if (!buf) return -ENOMEM; num = br_fdb_fillbuf(br, buf, maxnum, offset); if (num > 0) { if (copy_to_user(userbuf, buf, array_size(num, sizeof(struct __fdb_entry)))) num = -EFAULT; } kfree(buf); return num; } /* called with RTNL */ static int add_del_if(struct net_bridge *br, int ifindex, int isadd) { struct net *net = dev_net(br->dev); struct net_device *dev; int ret; if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; dev = __dev_get_by_index(net, ifindex); if (dev == NULL) return -EINVAL; if (isadd) ret = br_add_if(br, dev, NULL); else ret = br_del_if(br, dev); return ret; } #define BR_UARGS_MAX 4 static int br_dev_read_uargs(unsigned long *args, size_t nr_args, void __user **argp, void __user *data) { int ret; if (nr_args < 2 || nr_args > BR_UARGS_MAX) return -EINVAL; if (in_compat_syscall()) { unsigned int cargs[BR_UARGS_MAX]; int i; ret = copy_from_user(cargs, data, nr_args * sizeof(*cargs)); if (ret) goto fault; for (i = 0; i < nr_args; ++i) args[i] = cargs[i]; *argp = compat_ptr(args[1]); } else { ret = copy_from_user(args, data, nr_args * sizeof(*args)); if (ret) goto fault; *argp = (void __user *)args[1]; } return 0; fault: return -EFAULT; } /* * Legacy ioctl's through SIOCDEVPRIVATE * This interface is deprecated because it was too difficult * to do the translation for 32/64bit ioctl compatibility. */ int br_dev_siocdevprivate(struct net_device *dev, struct ifreq *rq, void __user *data, int cmd) { struct net_bridge *br = netdev_priv(dev); struct net_bridge_port *p = NULL; unsigned long args[4]; void __user *argp; int ret; ret = br_dev_read_uargs(args, ARRAY_SIZE(args), &argp, data); if (ret) return ret; switch (args[0]) { case BRCTL_ADD_IF: case BRCTL_DEL_IF: return add_del_if(br, args[1], args[0] == BRCTL_ADD_IF); case BRCTL_GET_BRIDGE_INFO: { struct __bridge_info b; memset(&b, 0, sizeof(struct __bridge_info)); rcu_read_lock(); memcpy(&b.designated_root, &br->designated_root, 8); memcpy(&b.bridge_id, &br->bridge_id, 8); b.root_path_cost = br->root_path_cost; b.max_age = jiffies_to_clock_t(br->max_age); b.hello_time = jiffies_to_clock_t(br->hello_time); b.forward_delay = br->forward_delay; b.bridge_max_age = br->bridge_max_age; b.bridge_hello_time = br->bridge_hello_time; b.bridge_forward_delay = jiffies_to_clock_t(br->bridge_forward_delay); b.topology_change = br->topology_change; b.topology_change_detected = br->topology_change_detected; b.root_port = br->root_port; b.stp_enabled = (br->stp_enabled != BR_NO_STP); b.ageing_time = jiffies_to_clock_t(br->ageing_time); b.hello_timer_value = br_timer_value(&br->hello_timer); b.tcn_timer_value = br_timer_value(&br->tcn_timer); b.topology_change_timer_value = br_timer_value(&br->topology_change_timer); b.gc_timer_value = br_timer_value(&br->gc_work.timer); rcu_read_unlock(); if (copy_to_user((void __user *)args[1], &b, sizeof(b))) return -EFAULT; return 0; } case BRCTL_GET_PORT_LIST: { int num, *indices; num = args[2]; if (num < 0) return -EINVAL; if (num == 0) num = 256; if (num > BR_MAX_PORTS) num = BR_MAX_PORTS; indices = kcalloc(num, sizeof(int), GFP_KERNEL); if (indices == NULL) return -ENOMEM; get_port_ifindices(br, indices, num); if (copy_to_user(argp, indices, array_size(num, sizeof(int)))) num = -EFAULT; kfree(indices); return num; } case BRCTL_SET_BRIDGE_FORWARD_DELAY: if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; ret = br_set_forward_delay(br, args[1]); break; case BRCTL_SET_BRIDGE_HELLO_TIME: if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; ret = br_set_hello_time(br, args[1]); break; case BRCTL_SET_BRIDGE_MAX_AGE: if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; ret = br_set_max_age(br, args[1]); break; case BRCTL_SET_AGEING_TIME: if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; ret = br_set_ageing_time(br, args[1]); break; case BRCTL_GET_PORT_INFO: { struct __port_info p; struct net_bridge_port *pt; rcu_read_lock(); if ((pt = br_get_port(br, args[2])) == NULL) { rcu_read_unlock(); return -EINVAL; } memset(&p, 0, sizeof(struct __port_info)); memcpy(&p.designated_root, &pt->designated_root, 8); memcpy(&p.designated_bridge, &pt->designated_bridge, 8); p.port_id = pt->port_id; p.designated_port = pt->designated_port; p.path_cost = pt->path_cost; p.designated_cost = pt->designated_cost; p.state = pt->state; p.top_change_ack = pt->topology_change_ack; p.config_pending = pt->config_pending; p.message_age_timer_value = br_timer_value(&pt->message_age_timer); p.forward_delay_timer_value = br_timer_value(&pt->forward_delay_timer); p.hold_timer_value = br_timer_value(&pt->hold_timer); rcu_read_unlock(); if (copy_to_user(argp, &p, sizeof(p))) return -EFAULT; return 0; } case BRCTL_SET_BRIDGE_STP_STATE: if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; ret = br_stp_set_enabled(br, args[1], NULL); break; case BRCTL_SET_BRIDGE_PRIORITY: if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; br_stp_set_bridge_priority(br, args[1]); ret = 0; break; case BRCTL_SET_PORT_PRIORITY: { if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; spin_lock_bh(&br->lock); if ((p = br_get_port(br, args[1])) == NULL) ret = -EINVAL; else ret = br_stp_set_port_priority(p, args[2]); spin_unlock_bh(&br->lock); break; } case BRCTL_SET_PATH_COST: { if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; spin_lock_bh(&br->lock); if ((p = br_get_port(br, args[1])) == NULL) ret = -EINVAL; else ret = br_stp_set_path_cost(p, args[2]); spin_unlock_bh(&br->lock); break; } case BRCTL_GET_FDB_ENTRIES: return get_fdb_entries(br, argp, args[2], args[3]); default: ret = -EOPNOTSUPP; } if (!ret) { if (p) br_ifinfo_notify(RTM_NEWLINK, NULL, p); else netdev_state_change(br->dev); } return ret; } static int old_deviceless(struct net *net, void __user *data) { unsigned long args[3]; void __user *argp; int ret; ret = br_dev_read_uargs(args, ARRAY_SIZE(args), &argp, data); if (ret) return ret; switch (args[0]) { case BRCTL_GET_VERSION: return BRCTL_VERSION; case BRCTL_GET_BRIDGES: { int *indices; int ret = 0; if (args[2] >= 2048) return -ENOMEM; indices = kcalloc(args[2], sizeof(int), GFP_KERNEL); if (indices == NULL) return -ENOMEM; args[2] = get_bridge_ifindices(net, indices, args[2]); ret = copy_to_user(argp, indices, array_size(args[2], sizeof(int))) ? -EFAULT : args[2]; kfree(indices); return ret; } case BRCTL_ADD_BRIDGE: case BRCTL_DEL_BRIDGE: { char buf[IFNAMSIZ]; if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (copy_from_user(buf, argp, IFNAMSIZ)) return -EFAULT; buf[IFNAMSIZ-1] = 0; if (args[0] == BRCTL_ADD_BRIDGE) return br_add_bridge(net, buf); return br_del_bridge(net, buf); } } return -EOPNOTSUPP; } int br_ioctl_stub(struct net *net, unsigned int cmd, void __user *uarg) { int ret = -EOPNOTSUPP; struct ifreq ifr; if (cmd == SIOCBRADDIF || cmd == SIOCBRDELIF) { void __user *data; char *colon; if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (get_user_ifreq(&ifr, &data, uarg)) return -EFAULT; ifr.ifr_name[IFNAMSIZ - 1] = 0; colon = strchr(ifr.ifr_name, ':'); if (colon) *colon = 0; } rtnl_lock(); switch (cmd) { case SIOCGIFBR: case SIOCSIFBR: ret = old_deviceless(net, uarg); break; case SIOCBRADDBR: case SIOCBRDELBR: { char buf[IFNAMSIZ]; if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) { ret = -EPERM; break; } if (copy_from_user(buf, uarg, IFNAMSIZ)) { ret = -EFAULT; break; } buf[IFNAMSIZ-1] = 0; if (cmd == SIOCBRADDBR) ret = br_add_bridge(net, buf); else ret = br_del_bridge(net, buf); } break; case SIOCBRADDIF: case SIOCBRDELIF: { struct net_device *dev; dev = __dev_get_by_name(net, ifr.ifr_name); if (!dev || !netif_device_present(dev)) { ret = -ENODEV; break; } if (!netif_is_bridge_master(dev)) { ret = -EOPNOTSUPP; break; } ret = add_del_if(netdev_priv(dev), ifr.ifr_ifindex, cmd == SIOCBRADDIF); } break; } rtnl_unlock(); return ret; } |
| 55 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_BLOCKGROUP_LOCK_H #define _LINUX_BLOCKGROUP_LOCK_H /* * Per-blockgroup locking for ext2 and ext3. * * Simple hashed spinlocking. */ #include <linux/spinlock.h> #include <linux/cache.h> #ifdef CONFIG_SMP #define NR_BG_LOCKS (4 << ilog2(NR_CPUS < 32 ? NR_CPUS : 32)) #else #define NR_BG_LOCKS 1 #endif struct bgl_lock { spinlock_t lock; } ____cacheline_aligned_in_smp; struct blockgroup_lock { struct bgl_lock locks[NR_BG_LOCKS]; }; static inline void bgl_lock_init(struct blockgroup_lock *bgl) { int i; for (i = 0; i < NR_BG_LOCKS; i++) spin_lock_init(&bgl->locks[i].lock); } static inline spinlock_t * bgl_lock_ptr(struct blockgroup_lock *bgl, unsigned int block_group) { return &bgl->locks[block_group & (NR_BG_LOCKS-1)].lock; } #endif |
| 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 13 13 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 14 15 13 13 13 13 13 13 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 | // SPDX-License-Identifier: GPL-2.0-only /* * linux/net/sunrpc/sched.c * * Scheduling for synchronous and asynchronous RPC requests. * * Copyright (C) 1996 Olaf Kirch, <okir@monad.swb.de> * * TCP NFS related read + write fixes * (C) 1999 Dave Airlie, University of Limerick, Ireland <airlied@linux.ie> */ #include <linux/module.h> #include <linux/sched.h> #include <linux/interrupt.h> #include <linux/slab.h> #include <linux/mempool.h> #include <linux/smp.h> #include <linux/spinlock.h> #include <linux/mutex.h> #include <linux/freezer.h> #include <linux/sched/mm.h> #include <linux/sunrpc/clnt.h> #include <linux/sunrpc/metrics.h> #include "sunrpc.h" #define CREATE_TRACE_POINTS #include <trace/events/sunrpc.h> /* * RPC slabs and memory pools */ #define RPC_BUFFER_MAXSIZE (2048) #define RPC_BUFFER_POOLSIZE (8) #define RPC_TASK_POOLSIZE (8) static struct kmem_cache *rpc_task_slabp __read_mostly; static struct kmem_cache *rpc_buffer_slabp __read_mostly; static mempool_t *rpc_task_mempool __read_mostly; static mempool_t *rpc_buffer_mempool __read_mostly; static void rpc_async_schedule(struct work_struct *); static void rpc_release_task(struct rpc_task *task); static void __rpc_queue_timer_fn(struct work_struct *); /* * RPC tasks sit here while waiting for conditions to improve. */ static struct rpc_wait_queue delay_queue; /* * rpciod-related stuff */ struct workqueue_struct *rpciod_workqueue __read_mostly; struct workqueue_struct *xprtiod_workqueue __read_mostly; EXPORT_SYMBOL_GPL(xprtiod_workqueue); gfp_t rpc_task_gfp_mask(void) { if (current->flags & PF_WQ_WORKER) return GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN; return GFP_KERNEL; } EXPORT_SYMBOL_GPL(rpc_task_gfp_mask); bool rpc_task_set_rpc_status(struct rpc_task *task, int rpc_status) { if (cmpxchg(&task->tk_rpc_status, 0, rpc_status) == 0) return true; return false; } unsigned long rpc_task_timeout(const struct rpc_task *task) { unsigned long timeout = READ_ONCE(task->tk_timeout); if (timeout != 0) { unsigned long now = jiffies; if (time_before(now, timeout)) return timeout - now; } return 0; } EXPORT_SYMBOL_GPL(rpc_task_timeout); /* * Disable the timer for a given RPC task. Should be called with * queue->lock and bh_disabled in order to avoid races within * rpc_run_timer(). */ static void __rpc_disable_timer(struct rpc_wait_queue *queue, struct rpc_task *task) { if (list_empty(&task->u.tk_wait.timer_list)) return; task->tk_timeout = 0; list_del(&task->u.tk_wait.timer_list); if (list_empty(&queue->timer_list.list)) cancel_delayed_work(&queue->timer_list.dwork); } static void rpc_set_queue_timer(struct rpc_wait_queue *queue, unsigned long expires) { unsigned long now = jiffies; queue->timer_list.expires = expires; if (time_before_eq(expires, now)) expires = 0; else expires -= now; mod_delayed_work(rpciod_workqueue, &queue->timer_list.dwork, expires); } /* * Set up a timer for the current task. */ static void __rpc_add_timer(struct rpc_wait_queue *queue, struct rpc_task *task, unsigned long timeout) { task->tk_timeout = timeout; if (list_empty(&queue->timer_list.list) || time_before(timeout, queue->timer_list.expires)) rpc_set_queue_timer(queue, timeout); list_add(&task->u.tk_wait.timer_list, &queue->timer_list.list); } static void rpc_set_waitqueue_priority(struct rpc_wait_queue *queue, int priority) { if (queue->priority != priority) { queue->priority = priority; queue->nr = 1U << priority; } } static void rpc_reset_waitqueue_priority(struct rpc_wait_queue *queue) { rpc_set_waitqueue_priority(queue, queue->maxpriority); } /* * Add a request to a queue list */ static void __rpc_list_enqueue_task(struct list_head *q, struct rpc_task *task) { struct rpc_task *t; list_for_each_entry(t, q, u.tk_wait.list) { if (t->tk_owner == task->tk_owner) { list_add_tail(&task->u.tk_wait.links, &t->u.tk_wait.links); /* Cache the queue head in task->u.tk_wait.list */ task->u.tk_wait.list.next = q; task->u.tk_wait.list.prev = NULL; return; } } INIT_LIST_HEAD(&task->u.tk_wait.links); list_add_tail(&task->u.tk_wait.list, q); } /* * Remove request from a queue list */ static void __rpc_list_dequeue_task(struct rpc_task *task) { struct list_head *q; struct rpc_task *t; if (task->u.tk_wait.list.prev == NULL) { list_del(&task->u.tk_wait.links); return; } if (!list_empty(&task->u.tk_wait.links)) { t = list_first_entry(&task->u.tk_wait.links, struct rpc_task, u.tk_wait.links); /* Assume __rpc_list_enqueue_task() cached the queue head */ q = t->u.tk_wait.list.next; list_add_tail(&t->u.tk_wait.list, q); list_del(&task->u.tk_wait.links); } list_del(&task->u.tk_wait.list); } /* * Add new request to a priority queue. */ static void __rpc_add_wait_queue_priority(struct rpc_wait_queue *queue, struct rpc_task *task, unsigned char queue_priority) { if (unlikely(queue_priority > queue->maxpriority)) queue_priority = queue->maxpriority; __rpc_list_enqueue_task(&queue->tasks[queue_priority], task); } /* * Add new request to wait queue. */ static void __rpc_add_wait_queue(struct rpc_wait_queue *queue, struct rpc_task *task, unsigned char queue_priority) { INIT_LIST_HEAD(&task->u.tk_wait.timer_list); if (RPC_IS_PRIORITY(queue)) __rpc_add_wait_queue_priority(queue, task, queue_priority); else list_add_tail(&task->u.tk_wait.list, &queue->tasks[0]); task->tk_waitqueue = queue; queue->qlen++; /* barrier matches the read in rpc_wake_up_task_queue_locked() */ smp_wmb(); rpc_set_queued(task); } /* * Remove request from a priority queue. */ static void __rpc_remove_wait_queue_priority(struct rpc_task *task) { __rpc_list_dequeue_task(task); } /* * Remove request from queue. * Note: must be called with spin lock held. */ static void __rpc_remove_wait_queue(struct rpc_wait_queue *queue, struct rpc_task *task) { __rpc_disable_timer(queue, task); if (RPC_IS_PRIORITY(queue)) __rpc_remove_wait_queue_priority(task); else list_del(&task->u.tk_wait.list); queue->qlen--; } static void __rpc_init_priority_wait_queue(struct rpc_wait_queue *queue, const char *qname, unsigned char nr_queues) { int i; spin_lock_init(&queue->lock); for (i = 0; i < ARRAY_SIZE(queue->tasks); i++) INIT_LIST_HEAD(&queue->tasks[i]); queue->maxpriority = nr_queues - 1; rpc_reset_waitqueue_priority(queue); queue->qlen = 0; queue->timer_list.expires = 0; INIT_DELAYED_WORK(&queue->timer_list.dwork, __rpc_queue_timer_fn); INIT_LIST_HEAD(&queue->timer_list.list); rpc_assign_waitqueue_name(queue, qname); } void rpc_init_priority_wait_queue(struct rpc_wait_queue *queue, const char *qname) { __rpc_init_priority_wait_queue(queue, qname, RPC_NR_PRIORITY); } EXPORT_SYMBOL_GPL(rpc_init_priority_wait_queue); void rpc_init_wait_queue(struct rpc_wait_queue *queue, const char *qname) { __rpc_init_priority_wait_queue(queue, qname, 1); } EXPORT_SYMBOL_GPL(rpc_init_wait_queue); void rpc_destroy_wait_queue(struct rpc_wait_queue *queue) { cancel_delayed_work_sync(&queue->timer_list.dwork); } EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue); static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode) { schedule(); if (signal_pending_state(mode, current)) return -ERESTARTSYS; return 0; } #if IS_ENABLED(CONFIG_SUNRPC_DEBUG) || IS_ENABLED(CONFIG_TRACEPOINTS) static void rpc_task_set_debuginfo(struct rpc_task *task) { struct rpc_clnt *clnt = task->tk_client; /* Might be a task carrying a reverse-direction operation */ if (!clnt) { static atomic_t rpc_pid; task->tk_pid = atomic_inc_return(&rpc_pid); return; } task->tk_pid = atomic_inc_return(&clnt->cl_pid); } #else static inline void rpc_task_set_debuginfo(struct rpc_task *task) { } #endif static void rpc_set_active(struct rpc_task *task) { rpc_task_set_debuginfo(task); set_bit(RPC_TASK_ACTIVE, &task->tk_runstate); trace_rpc_task_begin(task, NULL); } /* * Mark an RPC call as having completed by clearing the 'active' bit * and then waking up all tasks that were sleeping. */ static int rpc_complete_task(struct rpc_task *task) { void *m = &task->tk_runstate; wait_queue_head_t *wq = bit_waitqueue(m, RPC_TASK_ACTIVE); struct wait_bit_key k = __WAIT_BIT_KEY_INITIALIZER(m, RPC_TASK_ACTIVE); unsigned long flags; int ret; trace_rpc_task_complete(task, NULL); spin_lock_irqsave(&wq->lock, flags); clear_bit(RPC_TASK_ACTIVE, &task->tk_runstate); ret = atomic_dec_and_test(&task->tk_count); if (waitqueue_active(wq)) __wake_up_locked_key(wq, TASK_NORMAL, &k); spin_unlock_irqrestore(&wq->lock, flags); return ret; } /* * Allow callers to wait for completion of an RPC call * * Note the use of out_of_line_wait_on_bit() rather than wait_on_bit() * to enforce taking of the wq->lock and hence avoid races with * rpc_complete_task(). */ int rpc_wait_for_completion_task(struct rpc_task *task) { return out_of_line_wait_on_bit(&task->tk_runstate, RPC_TASK_ACTIVE, rpc_wait_bit_killable, TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); } EXPORT_SYMBOL_GPL(rpc_wait_for_completion_task); /* * Make an RPC task runnable. * * Note: If the task is ASYNC, and is being made runnable after sitting on an * rpc_wait_queue, this must be called with the queue spinlock held to protect * the wait queue operation. * Note the ordering of rpc_test_and_set_running() and rpc_clear_queued(), * which is needed to ensure that __rpc_execute() doesn't loop (due to the * lockless RPC_IS_QUEUED() test) before we've had a chance to test * the RPC_TASK_RUNNING flag. */ static void rpc_make_runnable(struct workqueue_struct *wq, struct rpc_task *task) { bool need_wakeup = !rpc_test_and_set_running(task); rpc_clear_queued(task); if (!need_wakeup) return; if (RPC_IS_ASYNC(task)) { INIT_WORK(&task->u.tk_work, rpc_async_schedule); queue_work(wq, &task->u.tk_work); } else { smp_mb__after_atomic(); wake_up_bit(&task->tk_runstate, RPC_TASK_QUEUED); } } /* * Prepare for sleeping on a wait queue. * By always appending tasks to the list we ensure FIFO behavior. * NB: An RPC task will only receive interrupt-driven events as long * as it's on a wait queue. */ static void __rpc_do_sleep_on_priority(struct rpc_wait_queue *q, struct rpc_task *task, unsigned char queue_priority) { trace_rpc_task_sleep(task, q); __rpc_add_wait_queue(q, task, queue_priority); } static void __rpc_sleep_on_priority(struct rpc_wait_queue *q, struct rpc_task *task, unsigned char queue_priority) { if (WARN_ON_ONCE(RPC_IS_QUEUED(task))) return; __rpc_do_sleep_on_priority(q, task, queue_priority); } static void __rpc_sleep_on_priority_timeout(struct rpc_wait_queue *q, struct rpc_task *task, unsigned long timeout, unsigned char queue_priority) { if (WARN_ON_ONCE(RPC_IS_QUEUED(task))) return; if (time_is_after_jiffies(timeout)) { __rpc_do_sleep_on_priority(q, task, queue_priority); __rpc_add_timer(q, task, timeout); } else task->tk_status = -ETIMEDOUT; } static void rpc_set_tk_callback(struct rpc_task *task, rpc_action action) { if (action && !WARN_ON_ONCE(task->tk_callback != NULL)) task->tk_callback = action; } static bool rpc_sleep_check_activated(struct rpc_task *task) { /* We shouldn't ever put an inactive task to sleep */ if (WARN_ON_ONCE(!RPC_IS_ACTIVATED(task))) { task->tk_status = -EIO; rpc_put_task_async(task); return false; } return true; } void rpc_sleep_on_timeout(struct rpc_wait_queue *q, struct rpc_task *task, rpc_action action, unsigned long timeout) { if (!rpc_sleep_check_activated(task)) return; rpc_set_tk_callback(task, action); /* * Protect the queue operations. */ spin_lock(&q->lock); __rpc_sleep_on_priority_timeout(q, task, timeout, task->tk_priority); spin_unlock(&q->lock); } EXPORT_SYMBOL_GPL(rpc_sleep_on_timeout); void rpc_sleep_on(struct rpc_wait_queue *q, struct rpc_task *task, rpc_action action) { if (!rpc_sleep_check_activated(task)) return; rpc_set_tk_callback(task, action); WARN_ON_ONCE(task->tk_timeout != 0); /* * Protect the queue operations. */ spin_lock(&q->lock); __rpc_sleep_on_priority(q, task, task->tk_priority); spin_unlock(&q->lock); } EXPORT_SYMBOL_GPL(rpc_sleep_on); void rpc_sleep_on_priority_timeout(struct rpc_wait_queue *q, struct rpc_task *task, unsigned long timeout, int priority) { if (!rpc_sleep_check_activated(task)) return; priority -= RPC_PRIORITY_LOW; /* * Protect the queue operations. */ spin_lock(&q->lock); __rpc_sleep_on_priority_timeout(q, task, timeout, priority); spin_unlock(&q->lock); } EXPORT_SYMBOL_GPL(rpc_sleep_on_priority_timeout); void rpc_sleep_on_priority(struct rpc_wait_queue *q, struct rpc_task *task, int priority) { if (!rpc_sleep_check_activated(task)) return; WARN_ON_ONCE(task->tk_timeout != 0); priority -= RPC_PRIORITY_LOW; /* * Protect the queue operations. */ spin_lock(&q->lock); __rpc_sleep_on_priority(q, task, priority); spin_unlock(&q->lock); } EXPORT_SYMBOL_GPL(rpc_sleep_on_priority); /** * __rpc_do_wake_up_task_on_wq - wake up a single rpc_task * @wq: workqueue on which to run task * @queue: wait queue * @task: task to be woken up * * Caller must hold queue->lock, and have cleared the task queued flag. */ static void __rpc_do_wake_up_task_on_wq(struct workqueue_struct *wq, struct rpc_wait_queue *queue, struct rpc_task *task) { /* Has the task been executed yet? If not, we cannot wake it up! */ if (!RPC_IS_ACTIVATED(task)) { printk(KERN_ERR "RPC: Inactive task (%p) being woken up!\n", task); return; } trace_rpc_task_wakeup(task, queue); __rpc_remove_wait_queue(queue, task); rpc_make_runnable(wq, task); } /* * Wake up a queued task while the queue lock is being held */ static struct rpc_task * rpc_wake_up_task_on_wq_queue_action_locked(struct workqueue_struct *wq, struct rpc_wait_queue *queue, struct rpc_task *task, bool (*action)(struct rpc_task *, void *), void *data) { if (RPC_IS_QUEUED(task)) { smp_rmb(); if (task->tk_waitqueue == queue) { if (action == NULL || action(task, data)) { __rpc_do_wake_up_task_on_wq(wq, queue, task); return task; } } } return NULL; } /* * Wake up a queued task while the queue lock is being held */ static void rpc_wake_up_task_queue_locked(struct rpc_wait_queue *queue, struct rpc_task *task) { rpc_wake_up_task_on_wq_queue_action_locked(rpciod_workqueue, queue, task, NULL, NULL); } /* * Wake up a task on a specific queue */ void rpc_wake_up_queued_task(struct rpc_wait_queue *queue, struct rpc_task *task) { if (!RPC_IS_QUEUED(task)) return; spin_lock(&queue->lock); rpc_wake_up_task_queue_locked(queue, task); spin_unlock(&queue->lock); } EXPORT_SYMBOL_GPL(rpc_wake_up_queued_task); static bool rpc_task_action_set_status(struct rpc_task *task, void *status) { task->tk_status = *(int *)status; return true; } static void rpc_wake_up_task_queue_set_status_locked(struct rpc_wait_queue *queue, struct rpc_task *task, int status) { rpc_wake_up_task_on_wq_queue_action_locked(rpciod_workqueue, queue, task, rpc_task_action_set_status, &status); } /** * rpc_wake_up_queued_task_set_status - wake up a task and set task->tk_status * @queue: pointer to rpc_wait_queue * @task: pointer to rpc_task * @status: integer error value * * If @task is queued on @queue, then it is woken up, and @task->tk_status is * set to the value of @status. */ void rpc_wake_up_queued_task_set_status(struct rpc_wait_queue *queue, struct rpc_task *task, int status) { if (!RPC_IS_QUEUED(task)) return; spin_lock(&queue->lock); rpc_wake_up_task_queue_set_status_locked(queue, task, status); spin_unlock(&queue->lock); } /* * Wake up the next task on a priority queue. */ static struct rpc_task *__rpc_find_next_queued_priority(struct rpc_wait_queue *queue) { struct list_head *q; struct rpc_task *task; /* * Service the privileged queue. */ q = &queue->tasks[RPC_NR_PRIORITY - 1]; if (queue->maxpriority > RPC_PRIORITY_PRIVILEGED && !list_empty(q)) { task = list_first_entry(q, struct rpc_task, u.tk_wait.list); goto out; } /* * Service a batch of tasks from a single owner. */ q = &queue->tasks[queue->priority]; if (!list_empty(q) && queue->nr) { queue->nr--; task = list_first_entry(q, struct rpc_task, u.tk_wait.list); goto out; } /* * Service the next queue. */ do { if (q == &queue->tasks[0]) q = &queue->tasks[queue->maxpriority]; else q = q - 1; if (!list_empty(q)) { task = list_first_entry(q, struct rpc_task, u.tk_wait.list); goto new_queue; } } while (q != &queue->tasks[queue->priority]); rpc_reset_waitqueue_priority(queue); return NULL; new_queue: rpc_set_waitqueue_priority(queue, (unsigned int)(q - &queue->tasks[0])); out: return task; } static struct rpc_task *__rpc_find_next_queued(struct rpc_wait_queue *queue) { if (RPC_IS_PRIORITY(queue)) return __rpc_find_next_queued_priority(queue); if (!list_empty(&queue->tasks[0])) return list_first_entry(&queue->tasks[0], struct rpc_task, u.tk_wait.list); return NULL; } /* * Wake up the first task on the wait queue. */ struct rpc_task *rpc_wake_up_first_on_wq(struct workqueue_struct *wq, struct rpc_wait_queue *queue, bool (*func)(struct rpc_task *, void *), void *data) { struct rpc_task *task = NULL; spin_lock(&queue->lock); task = __rpc_find_next_queued(queue); if (task != NULL) task = rpc_wake_up_task_on_wq_queue_action_locked(wq, queue, task, func, data); spin_unlock(&queue->lock); return task; } /* * Wake up the first task on the wait queue. */ struct rpc_task *rpc_wake_up_first(struct rpc_wait_queue *queue, bool (*func)(struct rpc_task *, void *), void *data) { return rpc_wake_up_first_on_wq(rpciod_workqueue, queue, func, data); } EXPORT_SYMBOL_GPL(rpc_wake_up_first); static bool rpc_wake_up_next_func(struct rpc_task *task, void *data) { return true; } /* * Wake up the next task on the wait queue. */ struct rpc_task *rpc_wake_up_next(struct rpc_wait_queue *queue) { return rpc_wake_up_first(queue, rpc_wake_up_next_func, NULL); } EXPORT_SYMBOL_GPL(rpc_wake_up_next); /** * rpc_wake_up_locked - wake up all rpc_tasks * @queue: rpc_wait_queue on which the tasks are sleeping * */ static void rpc_wake_up_locked(struct rpc_wait_queue *queue) { struct rpc_task *task; for (;;) { task = __rpc_find_next_queued(queue); if (task == NULL) break; rpc_wake_up_task_queue_locked(queue, task); } } /** * rpc_wake_up - wake up all rpc_tasks * @queue: rpc_wait_queue on which the tasks are sleeping * * Grabs queue->lock */ void rpc_wake_up(struct rpc_wait_queue *queue) { spin_lock(&queue->lock); rpc_wake_up_locked(queue); spin_unlock(&queue->lock); } EXPORT_SYMBOL_GPL(rpc_wake_up); /** * rpc_wake_up_status_locked - wake up all rpc_tasks and set their status value. * @queue: rpc_wait_queue on which the tasks are sleeping * @status: status value to set */ static void rpc_wake_up_status_locked(struct rpc_wait_queue *queue, int status) { struct rpc_task *task; for (;;) { task = __rpc_find_next_queued(queue); if (task == NULL) break; rpc_wake_up_task_queue_set_status_locked(queue, task, status); } } /** * rpc_wake_up_status - wake up all rpc_tasks and set their status value. * @queue: rpc_wait_queue on which the tasks are sleeping * @status: status value to set * * Grabs queue->lock */ void rpc_wake_up_status(struct rpc_wait_queue *queue, int status) { spin_lock(&queue->lock); rpc_wake_up_status_locked(queue, status); spin_unlock(&queue->lock); } EXPORT_SYMBOL_GPL(rpc_wake_up_status); static void __rpc_queue_timer_fn(struct work_struct *work) { struct rpc_wait_queue *queue = container_of(work, struct rpc_wait_queue, timer_list.dwork.work); struct rpc_task *task, *n; unsigned long expires, now, timeo; spin_lock(&queue->lock); expires = now = jiffies; list_for_each_entry_safe(task, n, &queue->timer_list.list, u.tk_wait.timer_list) { timeo = task->tk_timeout; if (time_after_eq(now, timeo)) { trace_rpc_task_timeout(task, task->tk_action); task->tk_status = -ETIMEDOUT; rpc_wake_up_task_queue_locked(queue, task); continue; } if (expires == now || time_after(expires, timeo)) expires = timeo; } if (!list_empty(&queue->timer_list.list)) rpc_set_queue_timer(queue, expires); spin_unlock(&queue->lock); } static void __rpc_atrun(struct rpc_task *task) { if (task->tk_status == -ETIMEDOUT) task->tk_status = 0; } /* * Run a task at a later time */ void rpc_delay(struct rpc_task *task, unsigned long delay) { rpc_sleep_on_timeout(&delay_queue, task, __rpc_atrun, jiffies + delay); } EXPORT_SYMBOL_GPL(rpc_delay); /* * Helper to call task->tk_ops->rpc_call_prepare */ void rpc_prepare_task(struct rpc_task *task) { task->tk_ops->rpc_call_prepare(task, task->tk_calldata); } static void rpc_init_task_statistics(struct rpc_task *task) { /* Initialize retry counters */ task->tk_garb_retry = 2; task->tk_cred_retry = 2; /* starting timestamp */ task->tk_start = ktime_get(); } static void rpc_reset_task_statistics(struct rpc_task *task) { task->tk_timeouts = 0; task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SENT); rpc_init_task_statistics(task); } /* * Helper that calls task->tk_ops->rpc_call_done if it exists */ void rpc_exit_task(struct rpc_task *task) { trace_rpc_task_end(task, task->tk_action); task->tk_action = NULL; if (task->tk_ops->rpc_count_stats) task->tk_ops->rpc_count_stats(task, task->tk_calldata); else if (task->tk_client) rpc_count_iostats(task, task->tk_client->cl_metrics); if (task->tk_ops->rpc_call_done != NULL) { trace_rpc_task_call_done(task, task->tk_ops->rpc_call_done); task->tk_ops->rpc_call_done(task, task->tk_calldata); if (task->tk_action != NULL) { /* Always release the RPC slot and buffer memory */ xprt_release(task); rpc_reset_task_statistics(task); } } } void rpc_signal_task(struct rpc_task *task) { struct rpc_wait_queue *queue; if (!RPC_IS_ACTIVATED(task)) return; if (!rpc_task_set_rpc_status(task, -ERESTARTSYS)) return; trace_rpc_task_signalled(task, task->tk_action); queue = READ_ONCE(task->tk_waitqueue); if (queue) rpc_wake_up_queued_task(queue, task); } void rpc_task_try_cancel(struct rpc_task *task, int error) { struct rpc_wait_queue *queue; if (!rpc_task_set_rpc_status(task, error)) return; queue = READ_ONCE(task->tk_waitqueue); if (queue) rpc_wake_up_queued_task(queue, task); } void rpc_exit(struct rpc_task *task, int status) { task->tk_status = status; task->tk_action = rpc_exit_task; rpc_wake_up_queued_task(task->tk_waitqueue, task); } EXPORT_SYMBOL_GPL(rpc_exit); void rpc_release_calldata(const struct rpc_call_ops *ops, void *calldata) { if (ops->rpc_release != NULL) ops->rpc_release(calldata); } static bool xprt_needs_memalloc(struct rpc_xprt *xprt, struct rpc_task *tk) { if (!xprt) return false; if (!atomic_read(&xprt->swapper)) return false; return test_bit(XPRT_LOCKED, &xprt->state) && xprt->snd_task == tk; } /* * This is the RPC `scheduler' (or rather, the finite state machine). */ static void __rpc_execute(struct rpc_task *task) { struct rpc_wait_queue *queue; int task_is_async = RPC_IS_ASYNC(task); int status = 0; unsigned long pflags = current->flags; WARN_ON_ONCE(RPC_IS_QUEUED(task)); if (RPC_IS_QUEUED(task)) return; for (;;) { void (*do_action)(struct rpc_task *); /* * Perform the next FSM step or a pending callback. * * tk_action may be NULL if the task has been killed. */ do_action = task->tk_action; /* Tasks with an RPC error status should exit */ if (do_action && do_action != rpc_exit_task && (status = READ_ONCE(task->tk_rpc_status)) != 0) { task->tk_status = status; do_action = rpc_exit_task; } /* Callbacks override all actions */ if (task->tk_callback) { do_action = task->tk_callback; task->tk_callback = NULL; } if (!do_action) break; if (RPC_IS_SWAPPER(task) || xprt_needs_memalloc(task->tk_xprt, task)) current->flags |= PF_MEMALLOC; trace_rpc_task_run_action(task, do_action); do_action(task); /* * Lockless check for whether task is sleeping or not. */ if (!RPC_IS_QUEUED(task)) { cond_resched(); continue; } /* * The queue->lock protects against races with * rpc_make_runnable(). * * Note that once we clear RPC_TASK_RUNNING on an asynchronous * rpc_task, rpc_make_runnable() can assign it to a * different workqueue. We therefore cannot assume that the * rpc_task pointer may still be dereferenced. */ queue = task->tk_waitqueue; spin_lock(&queue->lock); if (!RPC_IS_QUEUED(task)) { spin_unlock(&queue->lock); continue; } /* Wake up any task that has an exit status */ if (READ_ONCE(task->tk_rpc_status) != 0) { rpc_wake_up_task_queue_locked(queue, task); spin_unlock(&queue->lock); continue; } rpc_clear_running(task); spin_unlock(&queue->lock); if (task_is_async) goto out; /* sync task: sleep here */ trace_rpc_task_sync_sleep(task, task->tk_action); status = out_of_line_wait_on_bit(&task->tk_runstate, RPC_TASK_QUEUED, rpc_wait_bit_killable, TASK_KILLABLE|TASK_FREEZABLE); if (status < 0) { /* * When a sync task receives a signal, it exits with * -ERESTARTSYS. In order to catch any callbacks that * clean up after sleeping on some queue, we don't * break the loop here, but go around once more. */ rpc_signal_task(task); } trace_rpc_task_sync_wake(task, task->tk_action); } /* Release all resources associated with the task */ rpc_release_task(task); out: current_restore_flags(pflags, PF_MEMALLOC); } /* * User-visible entry point to the scheduler. * * This may be called recursively if e.g. an async NFS task updates * the attributes and finds that dirty pages must be flushed. * NOTE: Upon exit of this function the task is guaranteed to be * released. In particular note that tk_release() will have * been called, so your task memory may have been freed. */ void rpc_execute(struct rpc_task *task) { bool is_async = RPC_IS_ASYNC(task); rpc_set_active(task); rpc_make_runnable(rpciod_workqueue, task); if (!is_async) { unsigned int pflags = memalloc_nofs_save(); __rpc_execute(task); memalloc_nofs_restore(pflags); } } static void rpc_async_schedule(struct work_struct *work) { unsigned int pflags = memalloc_nofs_save(); __rpc_execute(container_of(work, struct rpc_task, u.tk_work)); memalloc_nofs_restore(pflags); } /** * rpc_malloc - allocate RPC buffer resources * @task: RPC task * * A single memory region is allocated, which is split between the * RPC call and RPC reply that this task is being used for. When * this RPC is retired, the memory is released by calling rpc_free. * * To prevent rpciod from hanging, this allocator never sleeps, * returning -ENOMEM and suppressing warning if the request cannot * be serviced immediately. The caller can arrange to sleep in a * way that is safe for rpciod. * * Most requests are 'small' (under 2KiB) and can be serviced from a * mempool, ensuring that NFS reads and writes can always proceed, * and that there is good locality of reference for these buffers. */ int rpc_malloc(struct rpc_task *task) { struct rpc_rqst *rqst = task->tk_rqstp; size_t size = rqst->rq_callsize + rqst->rq_rcvsize; struct rpc_buffer *buf; gfp_t gfp = rpc_task_gfp_mask(); size += sizeof(struct rpc_buffer); if (size <= RPC_BUFFER_MAXSIZE) { buf = kmem_cache_alloc(rpc_buffer_slabp, gfp); /* Reach for the mempool if dynamic allocation fails */ if (!buf && RPC_IS_ASYNC(task)) buf = mempool_alloc(rpc_buffer_mempool, GFP_NOWAIT); } else buf = kmalloc(size, gfp); if (!buf) return -ENOMEM; buf->len = size; rqst->rq_buffer = buf->data; rqst->rq_rbuffer = (char *)rqst->rq_buffer + rqst->rq_callsize; return 0; } /** * rpc_free - free RPC buffer resources allocated via rpc_malloc * @task: RPC task * */ void rpc_free(struct rpc_task *task) { void *buffer = task->tk_rqstp->rq_buffer; size_t size; struct rpc_buffer *buf; buf = container_of(buffer, struct rpc_buffer, data); size = buf->len; if (size <= RPC_BUFFER_MAXSIZE) mempool_free(buf, rpc_buffer_mempool); else kfree(buf); } /* * Creation and deletion of RPC task structures */ static void rpc_init_task(struct rpc_task *task, const struct rpc_task_setup *task_setup_data) { memset(task, 0, sizeof(*task)); atomic_set(&task->tk_count, 1); task->tk_flags = task_setup_data->flags; task->tk_ops = task_setup_data->callback_ops; task->tk_calldata = task_setup_data->callback_data; INIT_LIST_HEAD(&task->tk_task); task->tk_priority = task_setup_data->priority - RPC_PRIORITY_LOW; task->tk_owner = current->tgid; /* Initialize workqueue for async tasks */ task->tk_workqueue = task_setup_data->workqueue; task->tk_xprt = rpc_task_get_xprt(task_setup_data->rpc_client, xprt_get(task_setup_data->rpc_xprt)); task->tk_op_cred = get_rpccred(task_setup_data->rpc_op_cred); if (task->tk_ops->rpc_call_prepare != NULL) task->tk_action = rpc_prepare_task; rpc_init_task_statistics(task); } static struct rpc_task *rpc_alloc_task(void) { struct rpc_task *task; task = kmem_cache_alloc(rpc_task_slabp, rpc_task_gfp_mask()); if (task) return task; return mempool_alloc(rpc_task_mempool, GFP_NOWAIT); } /* * Create a new task for the specified client. */ struct rpc_task *rpc_new_task(const struct rpc_task_setup *setup_data) { struct rpc_task *task = setup_data->task; unsigned short flags = 0; if (task == NULL) { task = rpc_alloc_task(); if (task == NULL) { rpc_release_calldata(setup_data->callback_ops, setup_data->callback_data); return ERR_PTR(-ENOMEM); } flags = RPC_TASK_DYNAMIC; } rpc_init_task(task, setup_data); task->tk_flags |= flags; return task; } /* * rpc_free_task - release rpc task and perform cleanups * * Note that we free up the rpc_task _after_ rpc_release_calldata() * in order to work around a workqueue dependency issue. * * Tejun Heo states: * "Workqueue currently considers two work items to be the same if they're * on the same address and won't execute them concurrently - ie. it * makes a work item which is queued again while being executed wait * for the previous execution to complete. * * If a work function frees the work item, and then waits for an event * which should be performed by another work item and *that* work item * recycles the freed work item, it can create a false dependency loop. * There really is no reliable way to detect this short of verifying * every memory free." * */ static void rpc_free_task(struct rpc_task *task) { unsigned short tk_flags = task->tk_flags; put_rpccred(task->tk_op_cred); rpc_release_calldata(task->tk_ops, task->tk_calldata); if (tk_flags & RPC_TASK_DYNAMIC) mempool_free(task, rpc_task_mempool); } static void rpc_async_release(struct work_struct *work) { unsigned int pflags = memalloc_nofs_save(); rpc_free_task(container_of(work, struct rpc_task, u.tk_work)); memalloc_nofs_restore(pflags); } static void rpc_release_resources_task(struct rpc_task *task) { xprt_release(task); if (task->tk_msg.rpc_cred) { if (!(task->tk_flags & RPC_TASK_CRED_NOREF)) put_cred(task->tk_msg.rpc_cred); task->tk_msg.rpc_cred = NULL; } rpc_task_release_client(task); } static void rpc_final_put_task(struct rpc_task *task, struct workqueue_struct *q) { if (q != NULL) { INIT_WORK(&task->u.tk_work, rpc_async_release); queue_work(q, &task->u.tk_work); } else rpc_free_task(task); } static void rpc_do_put_task(struct rpc_task *task, struct workqueue_struct *q) { if (atomic_dec_and_test(&task->tk_count)) { rpc_release_resources_task(task); rpc_final_put_task(task, q); } } void rpc_put_task(struct rpc_task *task) { rpc_do_put_task(task, NULL); } EXPORT_SYMBOL_GPL(rpc_put_task); void rpc_put_task_async(struct rpc_task *task) { rpc_do_put_task(task, task->tk_workqueue); } EXPORT_SYMBOL_GPL(rpc_put_task_async); static void rpc_release_task(struct rpc_task *task) { WARN_ON_ONCE(RPC_IS_QUEUED(task)); rpc_release_resources_task(task); /* * Note: at this point we have been removed from rpc_clnt->cl_tasks, * so it should be safe to use task->tk_count as a test for whether * or not any other processes still hold references to our rpc_task. */ if (atomic_read(&task->tk_count) != 1 + !RPC_IS_ASYNC(task)) { /* Wake up anyone who may be waiting for task completion */ if (!rpc_complete_task(task)) return; } else { if (!atomic_dec_and_test(&task->tk_count)) return; } rpc_final_put_task(task, task->tk_workqueue); } int rpciod_up(void) { return try_module_get(THIS_MODULE) ? 0 : -EINVAL; } void rpciod_down(void) { module_put(THIS_MODULE); } /* * Start up the rpciod workqueue. */ static int rpciod_start(void) { struct workqueue_struct *wq; /* * Create the rpciod thread and wait for it to start. */ wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0); if (!wq) goto out_failed; rpciod_workqueue = wq; wq = alloc_workqueue("xprtiod", WQ_UNBOUND | WQ_MEM_RECLAIM, 0); if (!wq) goto free_rpciod; xprtiod_workqueue = wq; return 1; free_rpciod: wq = rpciod_workqueue; rpciod_workqueue = NULL; destroy_workqueue(wq); out_failed: return 0; } static void rpciod_stop(void) { struct workqueue_struct *wq = NULL; if (rpciod_workqueue == NULL) return; wq = rpciod_workqueue; rpciod_workqueue = NULL; destroy_workqueue(wq); wq = xprtiod_workqueue; xprtiod_workqueue = NULL; destroy_workqueue(wq); } void rpc_destroy_mempool(void) { rpciod_stop(); mempool_destroy(rpc_buffer_mempool); mempool_destroy(rpc_task_mempool); kmem_cache_destroy(rpc_task_slabp); kmem_cache_destroy(rpc_buffer_slabp); rpc_destroy_wait_queue(&delay_queue); } int rpc_init_mempool(void) { /* * The following is not strictly a mempool initialisation, * but there is no harm in doing it here */ rpc_init_wait_queue(&delay_queue, "delayq"); if (!rpciod_start()) goto err_nomem; rpc_task_slabp = kmem_cache_create("rpc_tasks", sizeof(struct rpc_task), 0, SLAB_HWCACHE_ALIGN, NULL); if (!rpc_task_slabp) goto err_nomem; rpc_buffer_slabp = kmem_cache_create("rpc_buffers", RPC_BUFFER_MAXSIZE, 0, SLAB_HWCACHE_ALIGN, NULL); if (!rpc_buffer_slabp) goto err_nomem; rpc_task_mempool = mempool_create_slab_pool(RPC_TASK_POOLSIZE, rpc_task_slabp); if (!rpc_task_mempool) goto err_nomem; rpc_buffer_mempool = mempool_create_slab_pool(RPC_BUFFER_POOLSIZE, rpc_buffer_slabp); if (!rpc_buffer_mempool) goto err_nomem; return 0; err_nomem: rpc_destroy_mempool(); return -ENOMEM; } |
| 2 2 2 1 2 5 5 3 3 3 1 3 3 3 3 3 3 3 3 3 3 2 2 1 2 2 2 2 2 2 2 12 12 12 12 12 12 12 2 12 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 6 9 9 9 9 9 9 9 9 3 3 3 6 6 6 6 6 6 9 10 10 10 8 10 2 2 4 3 4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 | // SPDX-License-Identifier: GPL-2.0-only #include <linux/net_tstamp.h> #include <linux/phy.h> #include <linux/phy_link_topology.h> #include <linux/ptp_clock_kernel.h> #include <net/netdev_lock.h> #include "netlink.h" #include "common.h" #include "bitset.h" #include "ts.h" struct tsinfo_req_info { struct ethnl_req_info base; struct hwtstamp_provider_desc hwprov_desc; }; struct tsinfo_reply_data { struct ethnl_reply_data base; struct kernel_ethtool_ts_info ts_info; struct ethtool_ts_stats stats; }; #define TSINFO_REQINFO(__req_base) \ container_of(__req_base, struct tsinfo_req_info, base) #define TSINFO_REPDATA(__reply_base) \ container_of(__reply_base, struct tsinfo_reply_data, base) #define ETHTOOL_TS_STAT_CNT \ (__ETHTOOL_A_TS_STAT_CNT - (ETHTOOL_A_TS_STAT_UNSPEC + 1)) const struct nla_policy ethnl_tsinfo_get_policy[ETHTOOL_A_TSINFO_MAX + 1] = { [ETHTOOL_A_TSINFO_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy_stats), [ETHTOOL_A_TSINFO_HWTSTAMP_PROVIDER] = NLA_POLICY_NESTED(ethnl_ts_hwtst_prov_policy), }; int ts_parse_hwtst_provider(const struct nlattr *nest, struct hwtstamp_provider_desc *hwprov_desc, struct netlink_ext_ack *extack, bool *mod) { struct nlattr *tb[ARRAY_SIZE(ethnl_ts_hwtst_prov_policy)]; int ret; ret = nla_parse_nested(tb, ARRAY_SIZE(ethnl_ts_hwtst_prov_policy) - 1, nest, ethnl_ts_hwtst_prov_policy, extack); if (ret < 0) return ret; if (NL_REQ_ATTR_CHECK(extack, nest, tb, ETHTOOL_A_TS_HWTSTAMP_PROVIDER_INDEX) || NL_REQ_ATTR_CHECK(extack, nest, tb, ETHTOOL_A_TS_HWTSTAMP_PROVIDER_QUALIFIER)) return -EINVAL; ethnl_update_u32(&hwprov_desc->index, tb[ETHTOOL_A_TS_HWTSTAMP_PROVIDER_INDEX], mod); ethnl_update_u32(&hwprov_desc->qualifier, tb[ETHTOOL_A_TS_HWTSTAMP_PROVIDER_QUALIFIER], mod); return 0; } static int tsinfo_parse_request(struct ethnl_req_info *req_base, struct nlattr **tb, struct netlink_ext_ack *extack) { struct tsinfo_req_info *req = TSINFO_REQINFO(req_base); bool mod = false; req->hwprov_desc.index = -1; if (!tb[ETHTOOL_A_TSINFO_HWTSTAMP_PROVIDER]) return 0; return ts_parse_hwtst_provider(tb[ETHTOOL_A_TSINFO_HWTSTAMP_PROVIDER], &req->hwprov_desc, extack, &mod); } static int tsinfo_prepare_data(const struct ethnl_req_info *req_base, struct ethnl_reply_data *reply_base, const struct genl_info *info) { struct tsinfo_reply_data *data = TSINFO_REPDATA(reply_base); struct tsinfo_req_info *req = TSINFO_REQINFO(req_base); struct net_device *dev = reply_base->dev; int ret; ret = ethnl_ops_begin(dev); if (ret < 0) return ret; if (req->hwprov_desc.index != -1) { ret = ethtool_get_ts_info_by_phc(dev, &data->ts_info, &req->hwprov_desc); ethnl_ops_complete(dev); return ret; } if (req_base->flags & ETHTOOL_FLAG_STATS) { ethtool_stats_init((u64 *)&data->stats, sizeof(data->stats) / sizeof(u64)); if (dev->ethtool_ops->get_ts_stats) dev->ethtool_ops->get_ts_stats(dev, &data->stats); } ret = __ethtool_get_ts_info(dev, &data->ts_info); ethnl_ops_complete(dev); return ret; } static int tsinfo_reply_size(const struct ethnl_req_info *req_base, const struct ethnl_reply_data *reply_base) { const struct tsinfo_reply_data *data = TSINFO_REPDATA(reply_base); bool compact = req_base->flags & ETHTOOL_FLAG_COMPACT_BITSETS; const struct kernel_ethtool_ts_info *ts_info = &data->ts_info; int len = 0; int ret; BUILD_BUG_ON(__SOF_TIMESTAMPING_CNT > 32); BUILD_BUG_ON(__HWTSTAMP_TX_CNT > 32); BUILD_BUG_ON(__HWTSTAMP_FILTER_CNT > 32); if (ts_info->so_timestamping) { ret = ethnl_bitset32_size(&ts_info->so_timestamping, NULL, __SOF_TIMESTAMPING_CNT, sof_timestamping_names, compact); if (ret < 0) return ret; len += ret; /* _TSINFO_TIMESTAMPING */ } if (ts_info->tx_types) { ret = ethnl_bitset32_size(&ts_info->tx_types, NULL, __HWTSTAMP_TX_CNT, ts_tx_type_names, compact); if (ret < 0) return ret; len += ret; /* _TSINFO_TX_TYPES */ } if (ts_info->rx_filters) { ret = ethnl_bitset32_size(&ts_info->rx_filters, NULL, __HWTSTAMP_FILTER_CNT, ts_rx_filter_names, compact); if (ret < 0) return ret; len += ret; /* _TSINFO_RX_FILTERS */ } if (ts_info->phc_index >= 0) { len += nla_total_size(sizeof(u32)); /* _TSINFO_PHC_INDEX */ /* _TSINFO_HWTSTAMP_PROVIDER */ len += nla_total_size(0) + 2 * nla_total_size(sizeof(u32)); } if (ts_info->phc_source) { len += nla_total_size(sizeof(u32)); /* _TSINFO_HWTSTAMP_SOURCE */ if (ts_info->phc_phyindex) /* _TSINFO_HWTSTAMP_PHYINDEX */ len += nla_total_size(sizeof(u32)); } if (req_base->flags & ETHTOOL_FLAG_STATS) len += nla_total_size(0) + /* _TSINFO_STATS */ nla_total_size_64bit(sizeof(u64)) * ETHTOOL_TS_STAT_CNT; return len; } static int tsinfo_put_stat(struct sk_buff *skb, u64 val, u16 attrtype) { if (val == ETHTOOL_STAT_NOT_SET) return 0; if (nla_put_uint(skb, attrtype, val)) return -EMSGSIZE; return 0; } static int tsinfo_put_stats(struct sk_buff *skb, const struct ethtool_ts_stats *stats) { struct nlattr *nest; nest = nla_nest_start(skb, ETHTOOL_A_TSINFO_STATS); if (!nest) return -EMSGSIZE; if (tsinfo_put_stat(skb, stats->tx_stats.pkts, ETHTOOL_A_TS_STAT_TX_PKTS) || tsinfo_put_stat(skb, stats->tx_stats.onestep_pkts_unconfirmed, ETHTOOL_A_TS_STAT_TX_ONESTEP_PKTS_UNCONFIRMED) || tsinfo_put_stat(skb, stats->tx_stats.lost, ETHTOOL_A_TS_STAT_TX_LOST) || tsinfo_put_stat(skb, stats->tx_stats.err, ETHTOOL_A_TS_STAT_TX_ERR)) goto err_cancel; nla_nest_end(skb, nest); return 0; err_cancel: nla_nest_cancel(skb, nest); return -EMSGSIZE; } static int tsinfo_fill_reply(struct sk_buff *skb, const struct ethnl_req_info *req_base, const struct ethnl_reply_data *reply_base) { const struct tsinfo_reply_data *data = TSINFO_REPDATA(reply_base); bool compact = req_base->flags & ETHTOOL_FLAG_COMPACT_BITSETS; const struct kernel_ethtool_ts_info *ts_info = &data->ts_info; int ret; if (ts_info->so_timestamping) { ret = ethnl_put_bitset32(skb, ETHTOOL_A_TSINFO_TIMESTAMPING, &ts_info->so_timestamping, NULL, __SOF_TIMESTAMPING_CNT, sof_timestamping_names, compact); if (ret < 0) return ret; } if (ts_info->tx_types) { ret = ethnl_put_bitset32(skb, ETHTOOL_A_TSINFO_TX_TYPES, &ts_info->tx_types, NULL, __HWTSTAMP_TX_CNT, ts_tx_type_names, compact); if (ret < 0) return ret; } if (ts_info->rx_filters) { ret = ethnl_put_bitset32(skb, ETHTOOL_A_TSINFO_RX_FILTERS, &ts_info->rx_filters, NULL, __HWTSTAMP_FILTER_CNT, ts_rx_filter_names, compact); if (ret < 0) return ret; } if (ts_info->phc_index >= 0) { struct nlattr *nest; ret = nla_put_u32(skb, ETHTOOL_A_TSINFO_PHC_INDEX, ts_info->phc_index); if (ret) return -EMSGSIZE; nest = nla_nest_start(skb, ETHTOOL_A_TSINFO_HWTSTAMP_PROVIDER); if (!nest) return -EMSGSIZE; if (nla_put_u32(skb, ETHTOOL_A_TS_HWTSTAMP_PROVIDER_INDEX, ts_info->phc_index) || nla_put_u32(skb, ETHTOOL_A_TS_HWTSTAMP_PROVIDER_QUALIFIER, ts_info->phc_qualifier)) { nla_nest_cancel(skb, nest); return -EMSGSIZE; } nla_nest_end(skb, nest); } if (ts_info->phc_source) { if (nla_put_u32(skb, ETHTOOL_A_TSINFO_HWTSTAMP_SOURCE, ts_info->phc_source)) return -EMSGSIZE; if (ts_info->phc_phyindex && nla_put_u32(skb, ETHTOOL_A_TSINFO_HWTSTAMP_PHYINDEX, ts_info->phc_phyindex)) return -EMSGSIZE; } if (req_base->flags & ETHTOOL_FLAG_STATS && tsinfo_put_stats(skb, &data->stats)) return -EMSGSIZE; return 0; } struct ethnl_tsinfo_dump_ctx { struct tsinfo_req_info *req_info; struct tsinfo_reply_data *reply_data; unsigned long pos_ifindex; bool netdev_dump_done; unsigned long pos_phyindex; enum hwtstamp_provider_qualifier pos_phcqualifier; }; static void *ethnl_tsinfo_prepare_dump(struct sk_buff *skb, struct net_device *dev, struct tsinfo_reply_data *reply_data, struct netlink_callback *cb) { struct ethnl_tsinfo_dump_ctx *ctx = (void *)cb->ctx; void *ehdr = NULL; ehdr = ethnl_dump_put(skb, cb, ETHTOOL_MSG_TSINFO_GET_REPLY); if (!ehdr) return ERR_PTR(-EMSGSIZE); reply_data = ctx->reply_data; memset(reply_data, 0, sizeof(*reply_data)); reply_data->base.dev = dev; reply_data->ts_info.cmd = ETHTOOL_GET_TS_INFO; reply_data->ts_info.phc_index = -1; return ehdr; } static int ethnl_tsinfo_end_dump(struct sk_buff *skb, struct net_device *dev, struct tsinfo_req_info *req_info, struct tsinfo_reply_data *reply_data, void *ehdr) { int ret; reply_data->ts_info.so_timestamping |= SOF_TIMESTAMPING_RX_SOFTWARE | SOF_TIMESTAMPING_SOFTWARE; ret = ethnl_fill_reply_header(skb, dev, ETHTOOL_A_TSINFO_HEADER); if (ret < 0) return ret; ret = tsinfo_fill_reply(skb, &req_info->base, &reply_data->base); if (ret < 0) return ret; reply_data->base.dev = NULL; genlmsg_end(skb, ehdr); return ret; } static int ethnl_tsinfo_dump_one_phydev(struct sk_buff *skb, struct net_device *dev, struct phy_device *phydev, struct netlink_callback *cb) { struct ethnl_tsinfo_dump_ctx *ctx = (void *)cb->ctx; struct tsinfo_reply_data *reply_data; struct tsinfo_req_info *req_info; void *ehdr = NULL; int ret = 0; if (!phy_has_tsinfo(phydev)) return -EOPNOTSUPP; reply_data = ctx->reply_data; req_info = ctx->req_info; ehdr = ethnl_tsinfo_prepare_dump(skb, dev, reply_data, cb); if (IS_ERR(ehdr)) return PTR_ERR(ehdr); ret = phy_ts_info(phydev, &reply_data->ts_info); if (ret < 0) goto err; if (reply_data->ts_info.phc_index >= 0) { reply_data->ts_info.phc_source = HWTSTAMP_SOURCE_PHYLIB; reply_data->ts_info.phc_phyindex = phydev->phyindex; } ret = ethnl_tsinfo_end_dump(skb, dev, req_info, reply_data, ehdr); if (ret < 0) goto err; return ret; err: genlmsg_cancel(skb, ehdr); return ret; } static int ethnl_tsinfo_dump_one_netdev(struct sk_buff *skb, struct net_device *dev, struct netlink_callback *cb) { struct ethnl_tsinfo_dump_ctx *ctx = (void *)cb->ctx; const struct ethtool_ops *ops = dev->ethtool_ops; struct tsinfo_reply_data *reply_data; struct tsinfo_req_info *req_info; void *ehdr = NULL; int ret = 0; if (!ops->get_ts_info) return -EOPNOTSUPP; reply_data = ctx->reply_data; req_info = ctx->req_info; for (; ctx->pos_phcqualifier < HWTSTAMP_PROVIDER_QUALIFIER_CNT; ctx->pos_phcqualifier++) { if (!net_support_hwtstamp_qualifier(dev, ctx->pos_phcqualifier)) continue; ehdr = ethnl_tsinfo_prepare_dump(skb, dev, reply_data, cb); if (IS_ERR(ehdr)) { ret = PTR_ERR(ehdr); goto err; } reply_data->ts_info.phc_qualifier = ctx->pos_phcqualifier; ret = ops->get_ts_info(dev, &reply_data->ts_info); if (ret < 0) goto err; if (reply_data->ts_info.phc_index >= 0) reply_data->ts_info.phc_source = HWTSTAMP_SOURCE_NETDEV; ret = ethnl_tsinfo_end_dump(skb, dev, req_info, reply_data, ehdr); if (ret < 0) goto err; } return ret; err: genlmsg_cancel(skb, ehdr); return ret; } static int ethnl_tsinfo_dump_one_net_topo(struct sk_buff *skb, struct net_device *dev, struct netlink_callback *cb) { struct ethnl_tsinfo_dump_ctx *ctx = (void *)cb->ctx; struct phy_device_node *pdn; int ret = 0; if (!ctx->netdev_dump_done) { ret = ethnl_tsinfo_dump_one_netdev(skb, dev, cb); if (ret < 0 && ret != -EOPNOTSUPP) return ret; ctx->netdev_dump_done = true; } if (!dev->link_topo) { if (phy_has_tsinfo(dev->phydev)) { ret = ethnl_tsinfo_dump_one_phydev(skb, dev, dev->phydev, cb); if (ret < 0 && ret != -EOPNOTSUPP) return ret; } return 0; } xa_for_each_start(&dev->link_topo->phys, ctx->pos_phyindex, pdn, ctx->pos_phyindex) { if (phy_has_tsinfo(pdn->phy)) { ret = ethnl_tsinfo_dump_one_phydev(skb, dev, pdn->phy, cb); if (ret < 0 && ret != -EOPNOTSUPP) return ret; } } return ret; } int ethnl_tsinfo_dumpit(struct sk_buff *skb, struct netlink_callback *cb) { struct ethnl_tsinfo_dump_ctx *ctx = (void *)cb->ctx; struct net *net = sock_net(skb->sk); struct net_device *dev; int ret = 0; rtnl_lock(); if (ctx->req_info->base.dev) { dev = ctx->req_info->base.dev; netdev_lock_ops(dev); ret = ethnl_tsinfo_dump_one_net_topo(skb, dev, cb); netdev_unlock_ops(dev); } else { for_each_netdev_dump(net, dev, ctx->pos_ifindex) { netdev_lock_ops(dev); ret = ethnl_tsinfo_dump_one_net_topo(skb, dev, cb); netdev_unlock_ops(dev); if (ret < 0 && ret != -EOPNOTSUPP) break; ctx->pos_phyindex = 0; ctx->netdev_dump_done = false; ctx->pos_phcqualifier = HWTSTAMP_PROVIDER_QUALIFIER_PRECISE; } } rtnl_unlock(); return ret; } int ethnl_tsinfo_start(struct netlink_callback *cb) { const struct genl_dumpit_info *info = genl_dumpit_info(cb); struct ethnl_tsinfo_dump_ctx *ctx = (void *)cb->ctx; struct nlattr **tb = info->info.attrs; struct tsinfo_reply_data *reply_data; struct tsinfo_req_info *req_info; int ret; BUILD_BUG_ON(sizeof(*ctx) > sizeof(cb->ctx)); req_info = kzalloc(sizeof(*req_info), GFP_KERNEL); if (!req_info) return -ENOMEM; reply_data = kzalloc(sizeof(*reply_data), GFP_KERNEL); if (!reply_data) { ret = -ENOMEM; goto free_req_info; } ret = ethnl_parse_header_dev_get(&req_info->base, tb[ETHTOOL_A_TSINFO_HEADER], sock_net(cb->skb->sk), cb->extack, false); if (ret < 0) goto free_reply_data; ctx->req_info = req_info; ctx->reply_data = reply_data; ctx->pos_ifindex = 0; ctx->pos_phyindex = 0; ctx->netdev_dump_done = false; ctx->pos_phcqualifier = HWTSTAMP_PROVIDER_QUALIFIER_PRECISE; return 0; free_reply_data: kfree(reply_data); free_req_info: kfree(req_info); return ret; } int ethnl_tsinfo_done(struct netlink_callback *cb) { struct ethnl_tsinfo_dump_ctx *ctx = (void *)cb->ctx; struct tsinfo_req_info *req_info = ctx->req_info; ethnl_parse_header_dev_put(&req_info->base); kfree(ctx->reply_data); kfree(ctx->req_info); return 0; } const struct ethnl_request_ops ethnl_tsinfo_request_ops = { .request_cmd = ETHTOOL_MSG_TSINFO_GET, .reply_cmd = ETHTOOL_MSG_TSINFO_GET_REPLY, .hdr_attr = ETHTOOL_A_TSINFO_HEADER, .req_info_size = sizeof(struct tsinfo_req_info), .reply_data_size = sizeof(struct tsinfo_reply_data), .parse_request = tsinfo_parse_request, .prepare_data = tsinfo_prepare_data, .reply_size = tsinfo_reply_size, .fill_reply = tsinfo_fill_reply, }; |
| 130 24 151 25 80 1 132 11 12 1 1 11 73 202 202 11 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef __LINUX_USB_H #define __LINUX_USB_H #include <linux/mod_devicetable.h> #include <linux/usb/ch9.h> #define USB_MAJOR 180 #define USB_DEVICE_MAJOR 189 #ifdef __KERNEL__ #include <linux/errno.h> /* for -ENODEV */ #include <linux/delay.h> /* for mdelay() */ #include <linux/interrupt.h> /* for in_interrupt() */ #include <linux/list.h> /* for struct list_head */ #include <linux/kref.h> /* for struct kref */ #include <linux/device.h> /* for struct device */ #include <linux/fs.h> /* for struct file_operations */ #include <linux/completion.h> /* for struct completion */ #include <linux/sched.h> /* for current && schedule_timeout */ #include <linux/mutex.h> /* for struct mutex */ #include <linux/pm_runtime.h> /* for runtime PM */ struct usb_device; struct usb_driver; /*-------------------------------------------------------------------------*/ /* * Host-side wrappers for standard USB descriptors ... these are parsed * from the data provided by devices. Parsing turns them from a flat * sequence of descriptors into a hierarchy: * * - devices have one (usually) or more configs; * - configs have one (often) or more interfaces; * - interfaces have one (usually) or more settings; * - each interface setting has zero or (usually) more endpoints. * - a SuperSpeed endpoint has a companion descriptor * * And there might be other descriptors mixed in with those. * * Devices may also have class-specific or vendor-specific descriptors. */ struct ep_device; /** * struct usb_host_endpoint - host-side endpoint descriptor and queue * @desc: descriptor for this endpoint, wMaxPacketSize in native byteorder * @ss_ep_comp: SuperSpeed companion descriptor for this endpoint * @ssp_isoc_ep_comp: SuperSpeedPlus isoc companion descriptor for this endpoint * @eusb2_isoc_ep_comp: eUSB2 isoc companion descriptor for this endpoint * @urb_list: urbs queued to this endpoint; maintained by usbcore * @hcpriv: for use by HCD; typically holds hardware dma queue head (QH) * with one or more transfer descriptors (TDs) per urb * @ep_dev: ep_device for sysfs info * @extra: descriptors following this endpoint in the configuration * @extralen: how many bytes of "extra" are valid * @enabled: URBs may be submitted to this endpoint * @streams: number of USB-3 streams allocated on the endpoint * * USB requests are always queued to a given endpoint, identified by a * descriptor within an active interface in a given USB configuration. */ struct usb_host_endpoint { struct usb_endpoint_descriptor desc; struct usb_ss_ep_comp_descriptor ss_ep_comp; struct usb_ssp_isoc_ep_comp_descriptor ssp_isoc_ep_comp; struct usb_eusb2_isoc_ep_comp_descriptor eusb2_isoc_ep_comp; struct list_head urb_list; void *hcpriv; struct ep_device *ep_dev; /* For sysfs info */ unsigned char *extra; /* Extra descriptors */ int extralen; int enabled; int streams; }; /* host-side wrapper for one interface setting's parsed descriptors */ struct usb_host_interface { struct usb_interface_descriptor desc; int extralen; unsigned char *extra; /* Extra descriptors */ /* array of desc.bNumEndpoints endpoints associated with this * interface setting. these will be in no particular order. */ struct usb_host_endpoint *endpoint; char *string; /* iInterface string, if present */ }; enum usb_interface_condition { USB_INTERFACE_UNBOUND = 0, USB_INTERFACE_BINDING, USB_INTERFACE_BOUND, USB_INTERFACE_UNBINDING, }; int __must_check usb_find_common_endpoints(struct usb_host_interface *alt, struct usb_endpoint_descriptor **bulk_in, struct usb_endpoint_descriptor **bulk_out, struct usb_endpoint_descriptor **int_in, struct usb_endpoint_descriptor **int_out); int __must_check usb_find_common_endpoints_reverse(struct usb_host_interface *alt, struct usb_endpoint_descriptor **bulk_in, struct usb_endpoint_descriptor **bulk_out, struct usb_endpoint_descriptor **int_in, struct usb_endpoint_descriptor **int_out); static inline int __must_check usb_find_bulk_in_endpoint(struct usb_host_interface *alt, struct usb_endpoint_descriptor **bulk_in) { return usb_find_common_endpoints(alt, bulk_in, NULL, NULL, NULL); } static inline int __must_check usb_find_bulk_out_endpoint(struct usb_host_interface *alt, struct usb_endpoint_descriptor **bulk_out) { return usb_find_common_endpoints(alt, NULL, bulk_out, NULL, NULL); } static inline int __must_check usb_find_int_in_endpoint(struct usb_host_interface *alt, struct usb_endpoint_descriptor **int_in) { return usb_find_common_endpoints(alt, NULL, NULL, int_in, NULL); } static inline int __must_check usb_find_int_out_endpoint(struct usb_host_interface *alt, struct usb_endpoint_descriptor **int_out) { return usb_find_common_endpoints(alt, NULL, NULL, NULL, int_out); } static inline int __must_check usb_find_last_bulk_in_endpoint(struct usb_host_interface *alt, struct usb_endpoint_descriptor **bulk_in) { return usb_find_common_endpoints_reverse(alt, bulk_in, NULL, NULL, NULL); } static inline int __must_check usb_find_last_bulk_out_endpoint(struct usb_host_interface *alt, struct usb_endpoint_descriptor **bulk_out) { return usb_find_common_endpoints_reverse(alt, NULL, bulk_out, NULL, NULL); } static inline int __must_check usb_find_last_int_in_endpoint(struct usb_host_interface *alt, struct usb_endpoint_descriptor **int_in) { return usb_find_common_endpoints_reverse(alt, NULL, NULL, int_in, NULL); } static inline int __must_check usb_find_last_int_out_endpoint(struct usb_host_interface *alt, struct usb_endpoint_descriptor **int_out) { return usb_find_common_endpoints_reverse(alt, NULL, NULL, NULL, int_out); } enum usb_wireless_status { USB_WIRELESS_STATUS_NA = 0, USB_WIRELESS_STATUS_DISCONNECTED, USB_WIRELESS_STATUS_CONNECTED, }; /** * struct usb_interface - what usb device drivers talk to * @altsetting: array of interface structures, one for each alternate * setting that may be selected. Each one includes a set of * endpoint configurations. They will be in no particular order. * @cur_altsetting: the current altsetting. * @num_altsetting: number of altsettings defined. * @intf_assoc: interface association descriptor * @minor: the minor number assigned to this interface, if this * interface is bound to a driver that uses the USB major number. * If this interface does not use the USB major, this field should * be unused. The driver should set this value in the probe() * function of the driver, after it has been assigned a minor * number from the USB core by calling usb_register_dev(). * @condition: binding state of the interface: not bound, binding * (in probe()), bound to a driver, or unbinding (in disconnect()) * @sysfs_files_created: sysfs attributes exist * @ep_devs_created: endpoint child pseudo-devices exist * @unregistering: flag set when the interface is being unregistered * @needs_remote_wakeup: flag set when the driver requires remote-wakeup * capability during autosuspend. * @needs_altsetting0: flag set when a set-interface request for altsetting 0 * has been deferred. * @needs_binding: flag set when the driver should be re-probed or unbound * following a reset or suspend operation it doesn't support. * @authorized: This allows to (de)authorize individual interfaces instead * a whole device in contrast to the device authorization. * @wireless_status: if the USB device uses a receiver/emitter combo, whether * the emitter is connected. * @wireless_status_work: Used for scheduling wireless status changes * from atomic context. * @dev: driver model's view of this device * @usb_dev: if an interface is bound to the USB major, this will point * to the sysfs representation for that device. * @reset_ws: Used for scheduling resets from atomic context. * @resetting_device: USB core reset the device, so use alt setting 0 as * current; needs bandwidth alloc after reset. * * USB device drivers attach to interfaces on a physical device. Each * interface encapsulates a single high level function, such as feeding * an audio stream to a speaker or reporting a change in a volume control. * Many USB devices only have one interface. The protocol used to talk to * an interface's endpoints can be defined in a usb "class" specification, * or by a product's vendor. The (default) control endpoint is part of * every interface, but is never listed among the interface's descriptors. * * The driver that is bound to the interface can use standard driver model * calls such as dev_get_drvdata() on the dev member of this structure. * * Each interface may have alternate settings. The initial configuration * of a device sets altsetting 0, but the device driver can change * that setting using usb_set_interface(). Alternate settings are often * used to control the use of periodic endpoints, such as by having * different endpoints use different amounts of reserved USB bandwidth. * All standards-conformant USB devices that use isochronous endpoints * will use them in non-default settings. * * The USB specification says that alternate setting numbers must run from * 0 to one less than the total number of alternate settings. But some * devices manage to mess this up, and the structures aren't necessarily * stored in numerical order anyhow. Use usb_altnum_to_altsetting() to * look up an alternate setting in the altsetting array based on its number. */ struct usb_interface { /* array of alternate settings for this interface, * stored in no particular order */ struct usb_host_interface *altsetting; struct usb_host_interface *cur_altsetting; /* the currently * active alternate setting */ unsigned num_altsetting; /* number of alternate settings */ /* If there is an interface association descriptor then it will list * the associated interfaces */ struct usb_interface_assoc_descriptor *intf_assoc; int minor; /* minor number this interface is * bound to */ enum usb_interface_condition condition; /* state of binding */ unsigned sysfs_files_created:1; /* the sysfs attributes exist */ unsigned ep_devs_created:1; /* endpoint "devices" exist */ unsigned unregistering:1; /* unregistration is in progress */ unsigned needs_remote_wakeup:1; /* driver requires remote wakeup */ unsigned needs_altsetting0:1; /* switch to altsetting 0 is pending */ unsigned needs_binding:1; /* needs delayed unbind/rebind */ unsigned resetting_device:1; /* true: bandwidth alloc after reset */ unsigned authorized:1; /* used for interface authorization */ enum usb_wireless_status wireless_status; struct work_struct wireless_status_work; struct device dev; /* interface specific device info */ struct device *usb_dev; struct work_struct reset_ws; /* for resets in atomic context */ }; #define to_usb_interface(__dev) container_of_const(__dev, struct usb_interface, dev) static inline void *usb_get_intfdata(struct usb_interface *intf) { return dev_get_drvdata(&intf->dev); } /** * usb_set_intfdata() - associate driver-specific data with an interface * @intf: USB interface * @data: driver data * * Drivers can use this function in their probe() callbacks to associate * driver-specific data with an interface. * * Note that there is generally no need to clear the driver-data pointer even * if some drivers do so for historical or implementation-specific reasons. */ static inline void usb_set_intfdata(struct usb_interface *intf, void *data) { dev_set_drvdata(&intf->dev, data); } struct usb_interface *usb_get_intf(struct usb_interface *intf); void usb_put_intf(struct usb_interface *intf); /* Hard limit */ #define USB_MAXENDPOINTS 30 /* this maximum is arbitrary */ #define USB_MAXINTERFACES 32 #define USB_MAXIADS (USB_MAXINTERFACES/2) bool usb_check_bulk_endpoints( const struct usb_interface *intf, const u8 *ep_addrs); bool usb_check_int_endpoints( const struct usb_interface *intf, const u8 *ep_addrs); /* * USB Resume Timer: Every Host controller driver should drive the resume * signalling on the bus for the amount of time defined by this macro. * * That way we will have a 'stable' behavior among all HCDs supported by Linux. * * Note that the USB Specification states we should drive resume for *at least* * 20 ms, but it doesn't give an upper bound. This creates two possible * situations which we want to avoid: * * (a) sometimes an msleep(20) might expire slightly before 20 ms, which causes * us to fail USB Electrical Tests, thus failing Certification * * (b) Some (many) devices actually need more than 20 ms of resume signalling, * and while we can argue that's against the USB Specification, we don't have * control over which devices a certification laboratory will be using for * certification. If CertLab uses a device which was tested against Windows and * that happens to have relaxed resume signalling rules, we might fall into * situations where we fail interoperability and electrical tests. * * In order to avoid both conditions, we're using a 40 ms resume timeout, which * should cope with both LPJ calibration errors and devices not following every * detail of the USB Specification. */ #define USB_RESUME_TIMEOUT 40 /* ms */ /** * struct usb_interface_cache - long-term representation of a device interface * @num_altsetting: number of altsettings defined. * @ref: reference counter. * @altsetting: variable-length array of interface structures, one for * each alternate setting that may be selected. Each one includes a * set of endpoint configurations. They will be in no particular order. * * These structures persist for the lifetime of a usb_device, unlike * struct usb_interface (which persists only as long as its configuration * is installed). The altsetting arrays can be accessed through these * structures at any time, permitting comparison of configurations and * providing support for the /sys/kernel/debug/usb/devices pseudo-file. */ struct usb_interface_cache { unsigned num_altsetting; /* number of alternate settings */ struct kref ref; /* reference counter */ /* variable-length array of alternate settings for this interface, * stored in no particular order */ struct usb_host_interface altsetting[]; }; #define ref_to_usb_interface_cache(r) \ container_of(r, struct usb_interface_cache, ref) #define altsetting_to_usb_interface_cache(a) \ container_of(a, struct usb_interface_cache, altsetting[0]) /** * struct usb_host_config - representation of a device's configuration * @desc: the device's configuration descriptor. * @string: pointer to the cached version of the iConfiguration string, if * present for this configuration. * @intf_assoc: list of any interface association descriptors in this config * @interface: array of pointers to usb_interface structures, one for each * interface in the configuration. The number of interfaces is stored * in desc.bNumInterfaces. These pointers are valid only while the * configuration is active. * @intf_cache: array of pointers to usb_interface_cache structures, one * for each interface in the configuration. These structures exist * for the entire life of the device. * @extra: pointer to buffer containing all extra descriptors associated * with this configuration (those preceding the first interface * descriptor). * @extralen: length of the extra descriptors buffer. * * USB devices may have multiple configurations, but only one can be active * at any time. Each encapsulates a different operational environment; * for example, a dual-speed device would have separate configurations for * full-speed and high-speed operation. The number of configurations * available is stored in the device descriptor as bNumConfigurations. * * A configuration can contain multiple interfaces. Each corresponds to * a different function of the USB device, and all are available whenever * the configuration is active. The USB standard says that interfaces * are supposed to be numbered from 0 to desc.bNumInterfaces-1, but a lot * of devices get this wrong. In addition, the interface array is not * guaranteed to be sorted in numerical order. Use usb_ifnum_to_if() to * look up an interface entry based on its number. * * Device drivers should not attempt to activate configurations. The choice * of which configuration to install is a policy decision based on such * considerations as available power, functionality provided, and the user's * desires (expressed through userspace tools). However, drivers can call * usb_reset_configuration() to reinitialize the current configuration and * all its interfaces. */ struct usb_host_config { struct usb_config_descriptor desc; char *string; /* iConfiguration string, if present */ /* List of any Interface Association Descriptors in this * configuration. */ struct usb_interface_assoc_descriptor *intf_assoc[USB_MAXIADS]; /* the interfaces associated with this configuration, * stored in no particular order */ struct usb_interface *interface[USB_MAXINTERFACES]; /* Interface information available even when this is not the * active configuration */ struct usb_interface_cache *intf_cache[USB_MAXINTERFACES]; unsigned char *extra; /* Extra descriptors */ int extralen; }; /* USB2.0 and USB3.0 device BOS descriptor set */ struct usb_host_bos { struct usb_bos_descriptor *desc; struct usb_ext_cap_descriptor *ext_cap; struct usb_ss_cap_descriptor *ss_cap; struct usb_ssp_cap_descriptor *ssp_cap; struct usb_ss_container_id_descriptor *ss_id; struct usb_ptm_cap_descriptor *ptm_cap; }; int __usb_get_extra_descriptor(char *buffer, unsigned size, unsigned char type, void **ptr, size_t min); #define usb_get_extra_descriptor(ifpoint, type, ptr) \ __usb_get_extra_descriptor((ifpoint)->extra, \ (ifpoint)->extralen, \ type, (void **)ptr, sizeof(**(ptr))) /* ----------------------------------------------------------------------- */ /* * Allocated per bus (tree of devices) we have: */ struct usb_bus { struct device *controller; /* host side hardware */ struct device *sysdev; /* as seen from firmware or bus */ int busnum; /* Bus number (in order of reg) */ const char *bus_name; /* stable id (PCI slot_name etc) */ u8 uses_pio_for_control; /* * Does the host controller use PIO * for control transfers? */ u8 otg_port; /* 0, or number of OTG/HNP port */ unsigned is_b_host:1; /* true during some HNP roleswitches */ unsigned b_hnp_enable:1; /* OTG: did A-Host enable HNP? */ unsigned no_stop_on_short:1; /* * Quirk: some controllers don't stop * the ep queue on a short transfer * with the URB_SHORT_NOT_OK flag set. */ unsigned no_sg_constraint:1; /* no sg constraint */ unsigned sg_tablesize; /* 0 or largest number of sg list entries */ int devnum_next; /* Next open device number in * round-robin allocation */ struct mutex devnum_next_mutex; /* devnum_next mutex */ DECLARE_BITMAP(devmap, 128); /* USB device number allocation bitmap */ struct usb_device *root_hub; /* Root hub */ struct usb_bus *hs_companion; /* Companion EHCI bus, if any */ int bandwidth_allocated; /* on this bus: how much of the time * reserved for periodic (intr/iso) * requests is used, on average? * Units: microseconds/frame. * Limits: Full/low speed reserve 90%, * while high speed reserves 80%. */ int bandwidth_int_reqs; /* number of Interrupt requests */ int bandwidth_isoc_reqs; /* number of Isoc. requests */ unsigned resuming_ports; /* bit array: resuming root-hub ports */ #if defined(CONFIG_USB_MON) || defined(CONFIG_USB_MON_MODULE) struct mon_bus *mon_bus; /* non-null when associated */ int monitored; /* non-zero when monitored */ #endif }; struct usb_dev_state; /* ----------------------------------------------------------------------- */ struct usb_tt; enum usb_link_tunnel_mode { USB_LINK_UNKNOWN = 0, USB_LINK_NATIVE, USB_LINK_TUNNELED, }; enum usb_port_connect_type { USB_PORT_CONNECT_TYPE_UNKNOWN = 0, USB_PORT_CONNECT_TYPE_HOT_PLUG, USB_PORT_CONNECT_TYPE_HARD_WIRED, USB_PORT_NOT_USED, }; /* * USB port quirks. */ /* For the given port, prefer the old (faster) enumeration scheme. */ #define USB_PORT_QUIRK_OLD_SCHEME BIT(0) /* Decrease TRSTRCY to 10ms during device enumeration. */ #define USB_PORT_QUIRK_FAST_ENUM BIT(1) /* * USB 2.0 Link Power Management (LPM) parameters. */ struct usb2_lpm_parameters { /* Best effort service latency indicate how long the host will drive * resume on an exit from L1. */ unsigned int besl; /* Timeout value in microseconds for the L1 inactivity (LPM) timer. * When the timer counts to zero, the parent hub will initiate a LPM * transition to L1. */ int timeout; }; /* * USB 3.0 Link Power Management (LPM) parameters. * * PEL and SEL are USB 3.0 Link PM latencies for device-initiated LPM exit. * MEL is the USB 3.0 Link PM latency for host-initiated LPM exit. * All three are stored in nanoseconds. */ struct usb3_lpm_parameters { /* * Maximum exit latency (MEL) for the host to send a packet to the * device (either a Ping for isoc endpoints, or a data packet for * interrupt endpoints), the hubs to decode the packet, and for all hubs * in the path to transition the links to U0. */ unsigned int mel; /* * Maximum exit latency for a device-initiated LPM transition to bring * all links into U0. Abbreviated as "PEL" in section 9.4.12 of the USB * 3.0 spec, with no explanation of what "P" stands for. "Path"? */ unsigned int pel; /* * The System Exit Latency (SEL) includes PEL, and three other * latencies. After a device initiates a U0 transition, it will take * some time from when the device sends the ERDY to when it will finally * receive the data packet. Basically, SEL should be the worse-case * latency from when a device starts initiating a U0 transition to when * it will get data. */ unsigned int sel; /* * The idle timeout value that is currently programmed into the parent * hub for this device. When the timer counts to zero, the parent hub * will initiate an LPM transition to either U1 or U2. */ int timeout; }; /** * struct usb_device - kernel's representation of a USB device * @devnum: device number; address on a USB bus * @devpath: device ID string for use in messages (e.g., /port/...) * @route: tree topology hex string for use with xHCI * @state: device state: configured, not attached, etc. * @speed: device speed: high/full/low (or error) * @rx_lanes: number of rx lanes in use, USB 3.2 adds dual-lane support * @tx_lanes: number of tx lanes in use, USB 3.2 adds dual-lane support * @ssp_rate: SuperSpeed Plus phy signaling rate and lane count * @tt: Transaction Translator info; used with low/full speed dev, highspeed hub * @ttport: device port on that tt hub * @toggle: one bit for each endpoint, with ([0] = IN, [1] = OUT) endpoints * @parent: our hub, unless we're the root * @bus: bus we're part of * @ep0: endpoint 0 data (default control pipe) * @dev: generic device interface * @descriptor: USB device descriptor * @bos: USB device BOS descriptor set * @config: all of the device's configs * @actconfig: the active configuration * @ep_in: array of IN endpoints * @ep_out: array of OUT endpoints * @rawdescriptors: raw descriptors for each config * @bus_mA: Current available from the bus * @portnum: parent port number (origin 1) * @level: number of USB hub ancestors * @devaddr: device address, XHCI: assigned by HW, others: same as devnum * @can_submit: URBs may be submitted * @persist_enabled: USB_PERSIST enabled for this device * @reset_in_progress: the device is being reset * @have_langid: whether string_langid is valid * @authorized: policy has said we can use it; * (user space) policy determines if we authorize this device to be * used or not. By default, wired USB devices are authorized. * WUSB devices are not, until we authorize them from user space. * FIXME -- complete doc * @authenticated: Crypto authentication passed * @tunnel_mode: Connection native or tunneled over USB4 * @usb4_link: device link to the USB4 host interface * @lpm_capable: device supports LPM * @lpm_devinit_allow: Allow USB3 device initiated LPM, exit latency is in range * @usb2_hw_lpm_capable: device can perform USB2 hardware LPM * @usb2_hw_lpm_besl_capable: device can perform USB2 hardware BESL LPM * @usb2_hw_lpm_enabled: USB2 hardware LPM is enabled * @usb2_hw_lpm_allowed: Userspace allows USB 2.0 LPM to be enabled * @usb3_lpm_u1_enabled: USB3 hardware U1 LPM enabled * @usb3_lpm_u2_enabled: USB3 hardware U2 LPM enabled * @string_langid: language ID for strings * @product: iProduct string, if present (static) * @manufacturer: iManufacturer string, if present (static) * @serial: iSerialNumber string, if present (static) * @filelist: usbfs files that are open to this device * @maxchild: number of ports if hub * @quirks: quirks of the whole device * @urbnum: number of URBs submitted for the whole device * @active_duration: total time device is not suspended * @connect_time: time device was first connected * @do_remote_wakeup: remote wakeup should be enabled * @reset_resume: needs reset instead of resume * @port_is_suspended: the upstream port is suspended (L2 or U3) * @offload_at_suspend: offload activities during suspend is enabled. * @offload_usage: number of offload activities happening on this usb device. * @slot_id: Slot ID assigned by xHCI * @l1_params: best effor service latency for USB2 L1 LPM state, and L1 timeout. * @u1_params: exit latencies for USB3 U1 LPM state, and hub-initiated timeout. * @u2_params: exit latencies for USB3 U2 LPM state, and hub-initiated timeout. * @lpm_disable_count: Ref count used by usb_disable_lpm() and usb_enable_lpm() * to keep track of the number of functions that require USB 3.0 Link Power * Management to be disabled for this usb_device. This count should only * be manipulated by those functions, with the bandwidth_mutex is held. * @hub_delay: cached value consisting of: * parent->hub_delay + wHubDelay + tTPTransmissionDelay (40ns) * Will be used as wValue for SetIsochDelay requests. * @use_generic_driver: ask driver core to reprobe using the generic driver. * * Notes: * Usbcore drivers should not set usbdev->state directly. Instead use * usb_set_device_state(). */ struct usb_device { int devnum; char devpath[16]; u32 route; enum usb_device_state state; enum usb_device_speed speed; unsigned int rx_lanes; unsigned int tx_lanes; enum usb_ssp_rate ssp_rate; struct usb_tt *tt; int ttport; unsigned int toggle[2]; struct usb_device *parent; struct usb_bus *bus; struct usb_host_endpoint ep0; struct device dev; struct usb_device_descriptor descriptor; struct usb_host_bos *bos; struct usb_host_config *config; struct usb_host_config *actconfig; struct usb_host_endpoint *ep_in[16]; struct usb_host_endpoint *ep_out[16]; char **rawdescriptors; unsigned short bus_mA; u8 portnum; u8 level; u8 devaddr; unsigned can_submit:1; unsigned persist_enabled:1; unsigned reset_in_progress:1; unsigned have_langid:1; unsigned authorized:1; unsigned authenticated:1; unsigned lpm_capable:1; unsigned lpm_devinit_allow:1; unsigned usb2_hw_lpm_capable:1; unsigned usb2_hw_lpm_besl_capable:1; unsigned usb2_hw_lpm_enabled:1; unsigned usb2_hw_lpm_allowed:1; unsigned usb3_lpm_u1_enabled:1; unsigned usb3_lpm_u2_enabled:1; int string_langid; /* static strings from the device */ char *product; char *manufacturer; char *serial; struct list_head filelist; int maxchild; u32 quirks; atomic_t urbnum; unsigned long active_duration; unsigned long connect_time; unsigned do_remote_wakeup:1; unsigned reset_resume:1; unsigned port_is_suspended:1; unsigned offload_at_suspend:1; int offload_usage; enum usb_link_tunnel_mode tunnel_mode; struct device_link *usb4_link; int slot_id; struct usb2_lpm_parameters l1_params; struct usb3_lpm_parameters u1_params; struct usb3_lpm_parameters u2_params; unsigned lpm_disable_count; u16 hub_delay; unsigned use_generic_driver:1; }; #define to_usb_device(__dev) container_of_const(__dev, struct usb_device, dev) static inline struct usb_device *__intf_to_usbdev(struct usb_interface *intf) { return to_usb_device(intf->dev.parent); } static inline const struct usb_device *__intf_to_usbdev_const(const struct usb_interface *intf) { return to_usb_device((const struct device *)intf->dev.parent); } #define interface_to_usbdev(intf) \ _Generic((intf), \ const struct usb_interface *: __intf_to_usbdev_const, \ struct usb_interface *: __intf_to_usbdev)(intf) extern struct usb_device *usb_get_dev(struct usb_device *dev); extern void usb_put_dev(struct usb_device *dev); extern struct usb_device *usb_hub_find_child(struct usb_device *hdev, int port1); /** * usb_hub_for_each_child - iterate over all child devices on the hub * @hdev: USB device belonging to the usb hub * @port1: portnum associated with child device * @child: child device pointer */ #define usb_hub_for_each_child(hdev, port1, child) \ for (port1 = 1, child = usb_hub_find_child(hdev, port1); \ port1 <= hdev->maxchild; \ child = usb_hub_find_child(hdev, ++port1)) \ if (!child) continue; else /* USB device locking */ #define usb_lock_device(udev) device_lock(&(udev)->dev) #define usb_unlock_device(udev) device_unlock(&(udev)->dev) #define usb_lock_device_interruptible(udev) device_lock_interruptible(&(udev)->dev) #define usb_trylock_device(udev) device_trylock(&(udev)->dev) extern int usb_lock_device_for_reset(struct usb_device *udev, const struct usb_interface *iface); /* USB port reset for device reinitialization */ extern int usb_reset_device(struct usb_device *dev); extern void usb_queue_reset_device(struct usb_interface *dev); extern struct device *usb_intf_get_dma_device(struct usb_interface *intf); #ifdef CONFIG_ACPI extern int usb_acpi_set_power_state(struct usb_device *hdev, int index, bool enable); extern bool usb_acpi_power_manageable(struct usb_device *hdev, int index); extern int usb_acpi_port_lpm_incapable(struct usb_device *hdev, int index); #else static inline int usb_acpi_set_power_state(struct usb_device *hdev, int index, bool enable) { return 0; } static inline bool usb_acpi_power_manageable(struct usb_device *hdev, int index) { return true; } static inline int usb_acpi_port_lpm_incapable(struct usb_device *hdev, int index) { return 0; } #endif /* USB autosuspend and autoresume */ #ifdef CONFIG_PM extern void usb_enable_autosuspend(struct usb_device *udev); extern void usb_disable_autosuspend(struct usb_device *udev); extern int usb_autopm_get_interface(struct usb_interface *intf); extern void usb_autopm_put_interface(struct usb_interface *intf); extern int usb_autopm_get_interface_async(struct usb_interface *intf); extern void usb_autopm_put_interface_async(struct usb_interface *intf); extern void usb_autopm_get_interface_no_resume(struct usb_interface *intf); extern void usb_autopm_put_interface_no_suspend(struct usb_interface *intf); static inline void usb_mark_last_busy(struct usb_device *udev) { pm_runtime_mark_last_busy(&udev->dev); } #else static inline void usb_enable_autosuspend(struct usb_device *udev) { } static inline void usb_disable_autosuspend(struct usb_device *udev) { } static inline int usb_autopm_get_interface(struct usb_interface *intf) { return 0; } static inline int usb_autopm_get_interface_async(struct usb_interface *intf) { return 0; } static inline void usb_autopm_put_interface(struct usb_interface *intf) { } static inline void usb_autopm_put_interface_async(struct usb_interface *intf) { } static inline void usb_autopm_get_interface_no_resume( struct usb_interface *intf) { } static inline void usb_autopm_put_interface_no_suspend( struct usb_interface *intf) { } static inline void usb_mark_last_busy(struct usb_device *udev) { } #endif #if IS_ENABLED(CONFIG_USB_XHCI_SIDEBAND) int usb_offload_get(struct usb_device *udev); int usb_offload_put(struct usb_device *udev); bool usb_offload_check(struct usb_device *udev); #else static inline int usb_offload_get(struct usb_device *udev) { return 0; } static inline int usb_offload_put(struct usb_device *udev) { return 0; } static inline bool usb_offload_check(struct usb_device *udev) { return false; } #endif extern int usb_disable_lpm(struct usb_device *udev); extern void usb_enable_lpm(struct usb_device *udev); /* Same as above, but these functions lock/unlock the bandwidth_mutex. */ extern int usb_unlocked_disable_lpm(struct usb_device *udev); extern void usb_unlocked_enable_lpm(struct usb_device *udev); extern int usb_disable_ltm(struct usb_device *udev); extern void usb_enable_ltm(struct usb_device *udev); static inline bool usb_device_supports_ltm(struct usb_device *udev) { if (udev->speed < USB_SPEED_SUPER || !udev->bos || !udev->bos->ss_cap) return false; return udev->bos->ss_cap->bmAttributes & USB_LTM_SUPPORT; } static inline bool usb_device_no_sg_constraint(struct usb_device *udev) { return udev && udev->bus && udev->bus->no_sg_constraint; } /*-------------------------------------------------------------------------*/ /* for drivers using iso endpoints */ extern int usb_get_current_frame_number(struct usb_device *usb_dev); /* Sets up a group of bulk endpoints to support multiple stream IDs. */ extern int usb_alloc_streams(struct usb_interface *interface, struct usb_host_endpoint **eps, unsigned int num_eps, unsigned int num_streams, gfp_t mem_flags); /* Reverts a group of bulk endpoints back to not using stream IDs. */ extern int usb_free_streams(struct usb_interface *interface, struct usb_host_endpoint **eps, unsigned int num_eps, gfp_t mem_flags); /* used these for multi-interface device registration */ extern int usb_driver_claim_interface(struct usb_driver *driver, struct usb_interface *iface, void *data); /** * usb_interface_claimed - returns true iff an interface is claimed * @iface: the interface being checked * * Return: %true (nonzero) iff the interface is claimed, else %false * (zero). * * Note: * Callers must own the driver model's usb bus readlock. So driver * probe() entries don't need extra locking, but other call contexts * may need to explicitly claim that lock. * */ static inline int usb_interface_claimed(struct usb_interface *iface) { return (iface->dev.driver != NULL); } extern void usb_driver_release_interface(struct usb_driver *driver, struct usb_interface *iface); int usb_set_wireless_status(struct usb_interface *iface, enum usb_wireless_status status); const struct usb_device_id *usb_match_id(struct usb_interface *interface, const struct usb_device_id *id); extern int usb_match_one_id(struct usb_interface *interface, const struct usb_device_id *id); extern int usb_for_each_dev(void *data, int (*fn)(struct usb_device *, void *)); extern struct usb_interface *usb_find_interface(struct usb_driver *drv, int minor); extern struct usb_interface *usb_ifnum_to_if(const struct usb_device *dev, unsigned ifnum); extern struct usb_host_interface *usb_altnum_to_altsetting( const struct usb_interface *intf, unsigned int altnum); extern struct usb_host_interface *usb_find_alt_setting( struct usb_host_config *config, unsigned int iface_num, unsigned int alt_num); /* port claiming functions */ int usb_hub_claim_port(struct usb_device *hdev, unsigned port1, struct usb_dev_state *owner); int usb_hub_release_port(struct usb_device *hdev, unsigned port1, struct usb_dev_state *owner); /** * usb_make_path - returns stable device path in the usb tree * @dev: the device whose path is being constructed * @buf: where to put the string * @size: how big is "buf"? * * Return: Length of the string (> 0) or negative if size was too small. * * Note: * This identifier is intended to be "stable", reflecting physical paths in * hardware such as physical bus addresses for host controllers or ports on * USB hubs. That makes it stay the same until systems are physically * reconfigured, by re-cabling a tree of USB devices or by moving USB host * controllers. Adding and removing devices, including virtual root hubs * in host controller driver modules, does not change these path identifiers; * neither does rebooting or re-enumerating. These are more useful identifiers * than changeable ("unstable") ones like bus numbers or device addresses. * * With a partial exception for devices connected to USB 2.0 root hubs, these * identifiers are also predictable. So long as the device tree isn't changed, * plugging any USB device into a given hub port always gives it the same path. * Because of the use of "companion" controllers, devices connected to ports on * USB 2.0 root hubs (EHCI host controllers) will get one path ID if they are * high speed, and a different one if they are full or low speed. */ static inline int usb_make_path(struct usb_device *dev, char *buf, size_t size) { int actual; actual = snprintf(buf, size, "usb-%s-%s", dev->bus->bus_name, dev->devpath); return (actual >= (int)size) ? -1 : actual; } /*-------------------------------------------------------------------------*/ #define USB_DEVICE_ID_MATCH_DEVICE \ (USB_DEVICE_ID_MATCH_VENDOR | USB_DEVICE_ID_MATCH_PRODUCT) #define USB_DEVICE_ID_MATCH_DEV_RANGE \ (USB_DEVICE_ID_MATCH_DEV_LO | USB_DEVICE_ID_MATCH_DEV_HI) #define USB_DEVICE_ID_MATCH_DEVICE_AND_VERSION \ (USB_DEVICE_ID_MATCH_DEVICE | USB_DEVICE_ID_MATCH_DEV_RANGE) #define USB_DEVICE_ID_MATCH_DEV_INFO \ (USB_DEVICE_ID_MATCH_DEV_CLASS | \ USB_DEVICE_ID_MATCH_DEV_SUBCLASS | \ USB_DEVICE_ID_MATCH_DEV_PROTOCOL) #define USB_DEVICE_ID_MATCH_INT_INFO \ (USB_DEVICE_ID_MATCH_INT_CLASS | \ USB_DEVICE_ID_MATCH_INT_SUBCLASS | \ USB_DEVICE_ID_MATCH_INT_PROTOCOL) /** * USB_DEVICE - macro used to describe a specific usb device * @vend: the 16 bit USB Vendor ID * @prod: the 16 bit USB Product ID * * This macro is used to create a struct usb_device_id that matches a * specific device. */ #define USB_DEVICE(vend, prod) \ .match_flags = USB_DEVICE_ID_MATCH_DEVICE, \ .idVendor = (vend), \ .idProduct = (prod) /** * USB_DEVICE_VER - describe a specific usb device with a version range * @vend: the 16 bit USB Vendor ID * @prod: the 16 bit USB Product ID * @lo: the bcdDevice_lo value * @hi: the bcdDevice_hi value * * This macro is used to create a struct usb_device_id that matches a * specific device, with a version range. */ #define USB_DEVICE_VER(vend, prod, lo, hi) \ .match_flags = USB_DEVICE_ID_MATCH_DEVICE_AND_VERSION, \ .idVendor = (vend), \ .idProduct = (prod), \ .bcdDevice_lo = (lo), \ .bcdDevice_hi = (hi) /** * USB_DEVICE_INTERFACE_CLASS - describe a usb device with a specific interface class * @vend: the 16 bit USB Vendor ID * @prod: the 16 bit USB Product ID * @cl: bInterfaceClass value * * This macro is used to create a struct usb_device_id that matches a * specific interface class of devices. */ #define USB_DEVICE_INTERFACE_CLASS(vend, prod, cl) \ .match_flags = USB_DEVICE_ID_MATCH_DEVICE | \ USB_DEVICE_ID_MATCH_INT_CLASS, \ .idVendor = (vend), \ .idProduct = (prod), \ .bInterfaceClass = (cl) /** * USB_DEVICE_INTERFACE_PROTOCOL - describe a usb device with a specific interface protocol * @vend: the 16 bit USB Vendor ID * @prod: the 16 bit USB Product ID * @pr: bInterfaceProtocol value * * This macro is used to create a struct usb_device_id that matches a * specific interface protocol of devices. */ #define USB_DEVICE_INTERFACE_PROTOCOL(vend, prod, pr) \ .match_flags = USB_DEVICE_ID_MATCH_DEVICE | \ USB_DEVICE_ID_MATCH_INT_PROTOCOL, \ .idVendor = (vend), \ .idProduct = (prod), \ .bInterfaceProtocol = (pr) /** * USB_DEVICE_INTERFACE_NUMBER - describe a usb device with a specific interface number * @vend: the 16 bit USB Vendor ID * @prod: the 16 bit USB Product ID * @num: bInterfaceNumber value * * This macro is used to create a struct usb_device_id that matches a * specific interface number of devices. */ #define USB_DEVICE_INTERFACE_NUMBER(vend, prod, num) \ .match_flags = USB_DEVICE_ID_MATCH_DEVICE | \ USB_DEVICE_ID_MATCH_INT_NUMBER, \ .idVendor = (vend), \ .idProduct = (prod), \ .bInterfaceNumber = (num) /** * USB_DEVICE_INFO - macro used to describe a class of usb devices * @cl: bDeviceClass value * @sc: bDeviceSubClass value * @pr: bDeviceProtocol value * * This macro is used to create a struct usb_device_id that matches a * specific class of devices. */ #define USB_DEVICE_INFO(cl, sc, pr) \ .match_flags = USB_DEVICE_ID_MATCH_DEV_INFO, \ .bDeviceClass = (cl), \ .bDeviceSubClass = (sc), \ .bDeviceProtocol = (pr) /** * USB_INTERFACE_INFO - macro used to describe a class of usb interfaces * @cl: bInterfaceClass value * @sc: bInterfaceSubClass value * @pr: bInterfaceProtocol value * * This macro is used to create a struct usb_device_id that matches a * specific class of interfaces. */ #define USB_INTERFACE_INFO(cl, sc, pr) \ .match_flags = USB_DEVICE_ID_MATCH_INT_INFO, \ .bInterfaceClass = (cl), \ .bInterfaceSubClass = (sc), \ .bInterfaceProtocol = (pr) /** * USB_DEVICE_AND_INTERFACE_INFO - describe a specific usb device with a class of usb interfaces * @vend: the 16 bit USB Vendor ID * @prod: the 16 bit USB Product ID * @cl: bInterfaceClass value * @sc: bInterfaceSubClass value * @pr: bInterfaceProtocol value * * This macro is used to create a struct usb_device_id that matches a * specific device with a specific class of interfaces. * * This is especially useful when explicitly matching devices that have * vendor specific bDeviceClass values, but standards-compliant interfaces. */ #define USB_DEVICE_AND_INTERFACE_INFO(vend, prod, cl, sc, pr) \ .match_flags = USB_DEVICE_ID_MATCH_INT_INFO \ | USB_DEVICE_ID_MATCH_DEVICE, \ .idVendor = (vend), \ .idProduct = (prod), \ .bInterfaceClass = (cl), \ .bInterfaceSubClass = (sc), \ .bInterfaceProtocol = (pr) /** * USB_VENDOR_AND_INTERFACE_INFO - describe a specific usb vendor with a class of usb interfaces * @vend: the 16 bit USB Vendor ID * @cl: bInterfaceClass value * @sc: bInterfaceSubClass value * @pr: bInterfaceProtocol value * * This macro is used to create a struct usb_device_id that matches a * specific vendor with a specific class of interfaces. * * This is especially useful when explicitly matching devices that have * vendor specific bDeviceClass values, but standards-compliant interfaces. */ #define USB_VENDOR_AND_INTERFACE_INFO(vend, cl, sc, pr) \ .match_flags = USB_DEVICE_ID_MATCH_INT_INFO \ | USB_DEVICE_ID_MATCH_VENDOR, \ .idVendor = (vend), \ .bInterfaceClass = (cl), \ .bInterfaceSubClass = (sc), \ .bInterfaceProtocol = (pr) /* ----------------------------------------------------------------------- */ /* Stuff for dynamic usb ids */ extern struct mutex usb_dynids_lock; struct usb_dynids { struct list_head list; }; struct usb_dynid { struct list_head node; struct usb_device_id id; }; extern ssize_t usb_store_new_id(struct usb_dynids *dynids, const struct usb_device_id *id_table, struct device_driver *driver, const char *buf, size_t count); extern ssize_t usb_show_dynids(struct usb_dynids *dynids, char *buf); /** * struct usb_driver - identifies USB interface driver to usbcore * @name: The driver name should be unique among USB drivers, * and should normally be the same as the module name. * @probe: Called to see if the driver is willing to manage a particular * interface on a device. If it is, probe returns zero and uses * usb_set_intfdata() to associate driver-specific data with the * interface. It may also use usb_set_interface() to specify the * appropriate altsetting. If unwilling to manage the interface, * return -ENODEV, if genuine IO errors occurred, an appropriate * negative errno value. * @disconnect: Called when the interface is no longer accessible, usually * because its device has been (or is being) disconnected or the * driver module is being unloaded. * @unlocked_ioctl: Used for drivers that want to talk to userspace through * the "usbfs" filesystem. This lets devices provide ways to * expose information to user space regardless of where they * do (or don't) show up otherwise in the filesystem. * @suspend: Called when the device is going to be suspended by the * system either from system sleep or runtime suspend context. The * return value will be ignored in system sleep context, so do NOT * try to continue using the device if suspend fails in this case. * Instead, let the resume or reset-resume routine recover from * the failure. * @resume: Called when the device is being resumed by the system. * @reset_resume: Called when the suspended device has been reset instead * of being resumed. * @pre_reset: Called by usb_reset_device() when the device is about to be * reset. This routine must not return until the driver has no active * URBs for the device, and no more URBs may be submitted until the * post_reset method is called. * @post_reset: Called by usb_reset_device() after the device * has been reset * @shutdown: Called at shut-down time to quiesce the device. * @id_table: USB drivers use ID table to support hotplugging. * Export this with MODULE_DEVICE_TABLE(usb,...). This must be set * or your driver's probe function will never get called. * @dev_groups: Attributes attached to the device that will be created once it * is bound to the driver. * @dynids: used internally to hold the list of dynamically added device * ids for this driver. * @driver: The driver-model core driver structure. * @no_dynamic_id: if set to 1, the USB core will not allow dynamic ids to be * added to this driver by preventing the sysfs file from being created. * @supports_autosuspend: if set to 0, the USB core will not allow autosuspend * for interfaces bound to this driver. * @soft_unbind: if set to 1, the USB core will not kill URBs and disable * endpoints before calling the driver's disconnect method. * @disable_hub_initiated_lpm: if set to 1, the USB core will not allow hubs * to initiate lower power link state transitions when an idle timeout * occurs. Device-initiated USB 3.0 link PM will still be allowed. * * USB interface drivers must provide a name, probe() and disconnect() * methods, and an id_table. Other driver fields are optional. * * The id_table is used in hotplugging. It holds a set of descriptors, * and specialized data may be associated with each entry. That table * is used by both user and kernel mode hotplugging support. * * The probe() and disconnect() methods are called in a context where * they can sleep, but they should avoid abusing the privilege. Most * work to connect to a device should be done when the device is opened, * and undone at the last close. The disconnect code needs to address * concurrency issues with respect to open() and close() methods, as * well as forcing all pending I/O requests to complete (by unlinking * them as necessary, and blocking until the unlinks complete). */ struct usb_driver { const char *name; int (*probe) (struct usb_interface *intf, const struct usb_device_id *id); void (*disconnect) (struct usb_interface *intf); int (*unlocked_ioctl) (struct usb_interface *intf, unsigned int code, void *buf); int (*suspend) (struct usb_interface *intf, pm_message_t message); int (*resume) (struct usb_interface *intf); int (*reset_resume)(struct usb_interface *intf); int (*pre_reset)(struct usb_interface *intf); int (*post_reset)(struct usb_interface *intf); void (*shutdown)(struct usb_interface *intf); const struct usb_device_id *id_table; const struct attribute_group **dev_groups; struct usb_dynids dynids; struct device_driver driver; unsigned int no_dynamic_id:1; unsigned int supports_autosuspend:1; unsigned int disable_hub_initiated_lpm:1; unsigned int soft_unbind:1; }; #define to_usb_driver(d) container_of_const(d, struct usb_driver, driver) /** * struct usb_device_driver - identifies USB device driver to usbcore * @name: The driver name should be unique among USB drivers, * and should normally be the same as the module name. * @match: If set, used for better device/driver matching. * @probe: Called to see if the driver is willing to manage a particular * device. If it is, probe returns zero and uses dev_set_drvdata() * to associate driver-specific data with the device. If unwilling * to manage the device, return a negative errno value. * @disconnect: Called when the device is no longer accessible, usually * because it has been (or is being) disconnected or the driver's * module is being unloaded. * @suspend: Called when the device is going to be suspended by the system. * @resume: Called when the device is being resumed by the system. * @choose_configuration: If non-NULL, called instead of the default * usb_choose_configuration(). If this returns an error then we'll go * on to call the normal usb_choose_configuration(). * @dev_groups: Attributes attached to the device that will be created once it * is bound to the driver. * @driver: The driver-model core driver structure. * @id_table: used with @match() to select better matching driver at * probe() time. * @supports_autosuspend: if set to 0, the USB core will not allow autosuspend * for devices bound to this driver. * @generic_subclass: if set to 1, the generic USB driver's probe, disconnect, * resume and suspend functions will be called in addition to the driver's * own, so this part of the setup does not need to be replicated. * * USB drivers must provide all the fields listed above except driver, * match, and id_table. */ struct usb_device_driver { const char *name; bool (*match) (struct usb_device *udev); int (*probe) (struct usb_device *udev); void (*disconnect) (struct usb_device *udev); int (*suspend) (struct usb_device *udev, pm_message_t message); int (*resume) (struct usb_device *udev, pm_message_t message); int (*choose_configuration) (struct usb_device *udev); const struct attribute_group **dev_groups; struct device_driver driver; const struct usb_device_id *id_table; unsigned int supports_autosuspend:1; unsigned int generic_subclass:1; }; #define to_usb_device_driver(d) container_of_const(d, struct usb_device_driver, driver) /** * struct usb_class_driver - identifies a USB driver that wants to use the USB major number * @name: the usb class device name for this driver. Will show up in sysfs. * @devnode: Callback to provide a naming hint for a possible * device node to create. * @fops: pointer to the struct file_operations of this driver. * @minor_base: the start of the minor range for this driver. * * This structure is used for the usb_register_dev() and * usb_deregister_dev() functions, to consolidate a number of the * parameters used for them. */ struct usb_class_driver { char *name; char *(*devnode)(const struct device *dev, umode_t *mode); const struct file_operations *fops; int minor_base; }; /* * use these in module_init()/module_exit() * and don't forget MODULE_DEVICE_TABLE(usb, ...) */ extern int usb_register_driver(struct usb_driver *, struct module *, const char *); /* use a define to avoid include chaining to get THIS_MODULE & friends */ #define usb_register(driver) \ usb_register_driver(driver, THIS_MODULE, KBUILD_MODNAME) extern void usb_deregister(struct usb_driver *); /** * module_usb_driver() - Helper macro for registering a USB driver * @__usb_driver: usb_driver struct * * Helper macro for USB drivers which do not do anything special in module * init/exit. This eliminates a lot of boilerplate. Each module may only * use this macro once, and calling it replaces module_init() and module_exit() */ #define module_usb_driver(__usb_driver) \ module_driver(__usb_driver, usb_register, \ usb_deregister) extern int usb_register_device_driver(struct usb_device_driver *, struct module *); extern void usb_deregister_device_driver(struct usb_device_driver *); extern int usb_register_dev(struct usb_interface *intf, struct usb_class_driver *class_driver); extern void usb_deregister_dev(struct usb_interface *intf, struct usb_class_driver *class_driver); extern int usb_disabled(void); /* ----------------------------------------------------------------------- */ /* * URB support, for asynchronous request completions */ /* * urb->transfer_flags: * * Note: URB_DIR_IN/OUT is automatically set in usb_submit_urb(). */ #define URB_SHORT_NOT_OK 0x0001 /* report short reads as errors */ #define URB_ISO_ASAP 0x0002 /* iso-only; use the first unexpired * slot in the schedule */ #define URB_NO_TRANSFER_DMA_MAP 0x0004 /* urb->transfer_dma valid on submit */ #define URB_ZERO_PACKET 0x0040 /* Finish bulk OUT with short packet */ #define URB_NO_INTERRUPT 0x0080 /* HINT: no non-error interrupt * needed */ #define URB_FREE_BUFFER 0x0100 /* Free transfer buffer with the URB */ /* The following flags are used internally by usbcore and HCDs */ #define URB_DIR_IN 0x0200 /* Transfer from device to host */ #define URB_DIR_OUT 0 #define URB_DIR_MASK URB_DIR_IN #define URB_DMA_MAP_SINGLE 0x00010000 /* Non-scatter-gather mapping */ #define URB_DMA_MAP_PAGE 0x00020000 /* HCD-unsupported S-G */ #define URB_DMA_MAP_SG 0x00040000 /* HCD-supported S-G */ #define URB_MAP_LOCAL 0x00080000 /* HCD-local-memory mapping */ #define URB_SETUP_MAP_SINGLE 0x00100000 /* Setup packet DMA mapped */ #define URB_SETUP_MAP_LOCAL 0x00200000 /* HCD-local setup packet */ #define URB_DMA_SG_COMBINED 0x00400000 /* S-G entries were combined */ #define URB_ALIGNED_TEMP_BUFFER 0x00800000 /* Temp buffer was alloc'd */ struct usb_iso_packet_descriptor { unsigned int offset; unsigned int length; /* expected length */ unsigned int actual_length; int status; }; struct urb; struct usb_anchor { struct list_head urb_list; wait_queue_head_t wait; spinlock_t lock; atomic_t suspend_wakeups; unsigned int poisoned:1; }; static inline void init_usb_anchor(struct usb_anchor *anchor) { memset(anchor, 0, sizeof(*anchor)); INIT_LIST_HEAD(&anchor->urb_list); init_waitqueue_head(&anchor->wait); spin_lock_init(&anchor->lock); } typedef void (*usb_complete_t)(struct urb *); /** * struct urb - USB Request Block * @urb_list: For use by current owner of the URB. * @anchor_list: membership in the list of an anchor * @anchor: to anchor URBs to a common mooring * @ep: Points to the endpoint's data structure. Will eventually * replace @pipe. * @pipe: Holds endpoint number, direction, type, and more. * Create these values with the eight macros available; * usb_{snd,rcv}TYPEpipe(dev,endpoint), where the TYPE is "ctrl" * (control), "bulk", "int" (interrupt), or "iso" (isochronous). * For example usb_sndbulkpipe() or usb_rcvintpipe(). Endpoint * numbers range from zero to fifteen. Note that "in" endpoint two * is a different endpoint (and pipe) from "out" endpoint two. * The current configuration controls the existence, type, and * maximum packet size of any given endpoint. * @stream_id: the endpoint's stream ID for bulk streams * @dev: Identifies the USB device to perform the request. * @status: This is read in non-iso completion functions to get the * status of the particular request. ISO requests only use it * to tell whether the URB was unlinked; detailed status for * each frame is in the fields of the iso_frame-desc. * @transfer_flags: A variety of flags may be used to affect how URB * submission, unlinking, or operation are handled. Different * kinds of URB can use different flags. * @transfer_buffer: This identifies the buffer to (or from) which the I/O * request will be performed unless URB_NO_TRANSFER_DMA_MAP is set * (however, do not leave garbage in transfer_buffer even then). * This buffer must be suitable for DMA; allocate it with * kmalloc() or equivalent. For transfers to "in" endpoints, contents * of this buffer will be modified. This buffer is used for the data * stage of control transfers. * @transfer_dma: When transfer_flags includes URB_NO_TRANSFER_DMA_MAP, * the device driver is saying that it provided this DMA address, * which the host controller driver should use in preference to the * transfer_buffer. * @sg: scatter gather buffer list, the buffer size of each element in * the list (except the last) must be divisible by the endpoint's * max packet size if no_sg_constraint isn't set in 'struct usb_bus' * @sgt: used to hold a scatter gather table returned by usb_alloc_noncoherent(), * which describes the allocated non-coherent and possibly non-contiguous * memory and is guaranteed to have 1 single DMA mapped segment. The * allocated memory needs to be freed by usb_free_noncoherent(). * @num_mapped_sgs: (internal) number of mapped sg entries * @num_sgs: number of entries in the sg list * @transfer_buffer_length: How big is transfer_buffer. The transfer may * be broken up into chunks according to the current maximum packet * size for the endpoint, which is a function of the configuration * and is encoded in the pipe. When the length is zero, neither * transfer_buffer nor transfer_dma is used. * @actual_length: This is read in non-iso completion functions, and * it tells how many bytes (out of transfer_buffer_length) were * transferred. It will normally be the same as requested, unless * either an error was reported or a short read was performed. * The URB_SHORT_NOT_OK transfer flag may be used to make such * short reads be reported as errors. * @setup_packet: Only used for control transfers, this points to eight bytes * of setup data. Control transfers always start by sending this data * to the device. Then transfer_buffer is read or written, if needed. * @setup_dma: DMA pointer for the setup packet. The caller must not use * this field; setup_packet must point to a valid buffer. * @start_frame: Returns the initial frame for isochronous transfers. * @number_of_packets: Lists the number of ISO transfer buffers. * @interval: Specifies the polling interval for interrupt or isochronous * transfers. The units are frames (milliseconds) for full and low * speed devices, and microframes (1/8 millisecond) for highspeed * and SuperSpeed devices. * @error_count: Returns the number of ISO transfers that reported errors. * @context: For use in completion functions. This normally points to * request-specific driver context. * @complete: Completion handler. This URB is passed as the parameter to the * completion function. The completion function may then do what * it likes with the URB, including resubmitting or freeing it. * @iso_frame_desc: Used to provide arrays of ISO transfer buffers and to * collect the transfer status for each buffer. * * This structure identifies USB transfer requests. URBs must be allocated by * calling usb_alloc_urb() and freed with a call to usb_free_urb(). * Initialization may be done using various usb_fill_*_urb() functions. URBs * are submitted using usb_submit_urb(), and pending requests may be canceled * using usb_unlink_urb() or usb_kill_urb(). * * Data Transfer Buffers: * * Normally drivers provide I/O buffers allocated with kmalloc() or otherwise * taken from the general page pool. That is provided by transfer_buffer * (control requests also use setup_packet), and host controller drivers * perform a dma mapping (and unmapping) for each buffer transferred. Those * mapping operations can be expensive on some platforms (perhaps using a dma * bounce buffer or talking to an IOMMU), * although they're cheap on commodity x86 and ppc hardware. * * Alternatively, drivers may pass the URB_NO_TRANSFER_DMA_MAP transfer flag, * which tells the host controller driver that no such mapping is needed for * the transfer_buffer since * the device driver is DMA-aware. For example, a device driver might * allocate a DMA buffer with usb_alloc_coherent() or call usb_buffer_map(). * When this transfer flag is provided, host controller drivers will * attempt to use the dma address found in the transfer_dma * field rather than determining a dma address themselves. * * Note that transfer_buffer must still be set if the controller * does not support DMA (as indicated by hcd_uses_dma()) and when talking * to root hub. If you have to transfer between highmem zone and the device * on such controller, create a bounce buffer or bail out with an error. * If transfer_buffer cannot be set (is in highmem) and the controller is DMA * capable, assign NULL to it, so that usbmon knows not to use the value. * The setup_packet must always be set, so it cannot be located in highmem. * * Initialization: * * All URBs submitted must initialize the dev, pipe, transfer_flags (may be * zero), and complete fields. All URBs must also initialize * transfer_buffer and transfer_buffer_length. They may provide the * URB_SHORT_NOT_OK transfer flag, indicating that short reads are * to be treated as errors; that flag is invalid for write requests. * * Bulk URBs may * use the URB_ZERO_PACKET transfer flag, indicating that bulk OUT transfers * should always terminate with a short packet, even if it means adding an * extra zero length packet. * * Control URBs must provide a valid pointer in the setup_packet field. * Unlike the transfer_buffer, the setup_packet may not be mapped for DMA * beforehand. * * Interrupt URBs must provide an interval, saying how often (in milliseconds * or, for highspeed devices, 125 microsecond units) * to poll for transfers. After the URB has been submitted, the interval * field reflects how the transfer was actually scheduled. * The polling interval may be more frequent than requested. * For example, some controllers have a maximum interval of 32 milliseconds, * while others support intervals of up to 1024 milliseconds. * Isochronous URBs also have transfer intervals. (Note that for isochronous * endpoints, as well as high speed interrupt endpoints, the encoding of * the transfer interval in the endpoint descriptor is logarithmic. * Device drivers must convert that value to linear units themselves.) * * If an isochronous endpoint queue isn't already running, the host * controller will schedule a new URB to start as soon as bandwidth * utilization allows. If the queue is running then a new URB will be * scheduled to start in the first transfer slot following the end of the * preceding URB, if that slot has not already expired. If the slot has * expired (which can happen when IRQ delivery is delayed for a long time), * the scheduling behavior depends on the URB_ISO_ASAP flag. If the flag * is clear then the URB will be scheduled to start in the expired slot, * implying that some of its packets will not be transferred; if the flag * is set then the URB will be scheduled in the first unexpired slot, * breaking the queue's synchronization. Upon URB completion, the * start_frame field will be set to the (micro)frame number in which the * transfer was scheduled. Ranges for frame counter values are HC-specific * and can go from as low as 256 to as high as 65536 frames. * * Isochronous URBs have a different data transfer model, in part because * the quality of service is only "best effort". Callers provide specially * allocated URBs, with number_of_packets worth of iso_frame_desc structures * at the end. Each such packet is an individual ISO transfer. Isochronous * URBs are normally queued, submitted by drivers to arrange that * transfers are at least double buffered, and then explicitly resubmitted * in completion handlers, so * that data (such as audio or video) streams at as constant a rate as the * host controller scheduler can support. * * Completion Callbacks: * * The completion callback is made in_interrupt(), and one of the first * things that a completion handler should do is check the status field. * The status field is provided for all URBs. It is used to report * unlinked URBs, and status for all non-ISO transfers. It should not * be examined before the URB is returned to the completion handler. * * The context field is normally used to link URBs back to the relevant * driver or request state. * * When the completion callback is invoked for non-isochronous URBs, the * actual_length field tells how many bytes were transferred. This field * is updated even when the URB terminated with an error or was unlinked. * * ISO transfer status is reported in the status and actual_length fields * of the iso_frame_desc array, and the number of errors is reported in * error_count. Completion callbacks for ISO transfers will normally * (re)submit URBs to ensure a constant transfer rate. * * Note that even fields marked "public" should not be touched by the driver * when the urb is owned by the hcd, that is, since the call to * usb_submit_urb() till the entry into the completion routine. */ struct urb { /* private: usb core and host controller only fields in the urb */ struct kref kref; /* reference count of the URB */ int unlinked; /* unlink error code */ void *hcpriv; /* private data for host controller */ atomic_t use_count; /* concurrent submissions counter */ atomic_t reject; /* submissions will fail */ /* public: documented fields in the urb that can be used by drivers */ struct list_head urb_list; /* list head for use by the urb's * current owner */ struct list_head anchor_list; /* the URB may be anchored */ struct usb_anchor *anchor; struct usb_device *dev; /* (in) pointer to associated device */ struct usb_host_endpoint *ep; /* (internal) pointer to endpoint */ unsigned int pipe; /* (in) pipe information */ unsigned int stream_id; /* (in) stream ID */ int status; /* (return) non-ISO status */ unsigned int transfer_flags; /* (in) URB_SHORT_NOT_OK | ...*/ void *transfer_buffer; /* (in) associated data buffer */ dma_addr_t transfer_dma; /* (in) dma addr for transfer_buffer */ struct scatterlist *sg; /* (in) scatter gather buffer list */ struct sg_table *sgt; /* (in) scatter gather table for noncoherent buffer */ int num_mapped_sgs; /* (internal) mapped sg entries */ int num_sgs; /* (in) number of entries in the sg list */ u32 transfer_buffer_length; /* (in) data buffer length */ u32 actual_length; /* (return) actual transfer length */ unsigned char *setup_packet; /* (in) setup packet (control only) */ dma_addr_t setup_dma; /* (in) dma addr for setup_packet */ int start_frame; /* (modify) start frame (ISO) */ int number_of_packets; /* (in) number of ISO packets */ int interval; /* (modify) transfer interval * (INT/ISO) */ int error_count; /* (return) number of ISO errors */ void *context; /* (in) context for completion */ usb_complete_t complete; /* (in) completion routine */ struct usb_iso_packet_descriptor iso_frame_desc[]; /* (in) ISO ONLY */ }; /* ----------------------------------------------------------------------- */ /** * usb_fill_control_urb - initializes a control urb * @urb: pointer to the urb to initialize. * @dev: pointer to the struct usb_device for this urb. * @pipe: the endpoint pipe * @setup_packet: pointer to the setup_packet buffer. The buffer must be * suitable for DMA. * @transfer_buffer: pointer to the transfer buffer. The buffer must be * suitable for DMA. * @buffer_length: length of the transfer buffer * @complete_fn: pointer to the usb_complete_t function * @context: what to set the urb context to. * * Initializes a control urb with the proper information needed to submit * it to a device. * * The transfer buffer and the setup_packet buffer will most likely be filled * or read via DMA. The simplest way to get a buffer that can be DMAed to is * allocating it via kmalloc() or equivalent, even for very small buffers. * If the buffers are embedded in a bigger structure, there is a risk that * the buffer itself, the previous fields and/or the next fields are corrupted * due to cache incoherencies; or slowed down if they are evicted from the * cache. For more information, check &struct urb. * */ static inline void usb_fill_control_urb(struct urb *urb, struct usb_device *dev, unsigned int pipe, unsigned char *setup_packet, void *transfer_buffer, int buffer_length, usb_complete_t complete_fn, void *context) { urb->dev = dev; urb->pipe = pipe; urb->setup_packet = setup_packet; urb->transfer_buffer = transfer_buffer; urb->transfer_buffer_length = buffer_length; urb->complete = complete_fn; urb->context = context; } /** * usb_fill_bulk_urb - macro to help initialize a bulk urb * @urb: pointer to the urb to initialize. * @dev: pointer to the struct usb_device for this urb. * @pipe: the endpoint pipe * @transfer_buffer: pointer to the transfer buffer. The buffer must be * suitable for DMA. * @buffer_length: length of the transfer buffer * @complete_fn: pointer to the usb_complete_t function * @context: what to set the urb context to. * * Initializes a bulk urb with the proper information needed to submit it * to a device. * * Refer to usb_fill_control_urb() for a description of the requirements for * transfer_buffer. */ static inline void usb_fill_bulk_urb(struct urb *urb, struct usb_device *dev, unsigned int pipe, void *transfer_buffer, int buffer_length, usb_complete_t complete_fn, void *context) { urb->dev = dev; urb->pipe = pipe; urb->transfer_buffer = transfer_buffer; urb->transfer_buffer_length = buffer_length; urb->complete = complete_fn; urb->context = context; } /** * usb_fill_int_urb - macro to help initialize a interrupt urb * @urb: pointer to the urb to initialize. * @dev: pointer to the struct usb_device for this urb. * @pipe: the endpoint pipe * @transfer_buffer: pointer to the transfer buffer. The buffer must be * suitable for DMA. * @buffer_length: length of the transfer buffer * @complete_fn: pointer to the usb_complete_t function * @context: what to set the urb context to. * @interval: what to set the urb interval to, encoded like * the endpoint descriptor's bInterval value. * * Initializes a interrupt urb with the proper information needed to submit * it to a device. * * Refer to usb_fill_control_urb() for a description of the requirements for * transfer_buffer. * * Note that High Speed and SuperSpeed(+) interrupt endpoints use a logarithmic * encoding of the endpoint interval, and express polling intervals in * microframes (eight per millisecond) rather than in frames (one per * millisecond). */ static inline void usb_fill_int_urb(struct urb *urb, struct usb_device *dev, unsigned int pipe, void *transfer_buffer, int buffer_length, usb_complete_t complete_fn, void *context, int interval) { urb->dev = dev; urb->pipe = pipe; urb->transfer_buffer = transfer_buffer; urb->transfer_buffer_length = buffer_length; urb->complete = complete_fn; urb->context = context; if (dev->speed == USB_SPEED_HIGH || dev->speed >= USB_SPEED_SUPER) { /* make sure interval is within allowed range */ interval = clamp(interval, 1, 16); urb->interval = 1 << (interval - 1); } else { urb->interval = interval; } urb->start_frame = -1; } extern void usb_init_urb(struct urb *urb); extern struct urb *usb_alloc_urb(int iso_packets, gfp_t mem_flags); extern void usb_free_urb(struct urb *urb); #define usb_put_urb usb_free_urb extern struct urb *usb_get_urb(struct urb *urb); extern int usb_submit_urb(struct urb *urb, gfp_t mem_flags); extern int usb_unlink_urb(struct urb *urb); extern void usb_kill_urb(struct urb *urb); extern void usb_poison_urb(struct urb *urb); extern void usb_unpoison_urb(struct urb *urb); extern void usb_block_urb(struct urb *urb); extern void usb_kill_anchored_urbs(struct usb_anchor *anchor); extern void usb_poison_anchored_urbs(struct usb_anchor *anchor); extern void usb_unpoison_anchored_urbs(struct usb_anchor *anchor); extern void usb_anchor_suspend_wakeups(struct usb_anchor *anchor); extern void usb_anchor_resume_wakeups(struct usb_anchor *anchor); extern void usb_anchor_urb(struct urb *urb, struct usb_anchor *anchor); extern void usb_unanchor_urb(struct urb *urb); extern int usb_wait_anchor_empty_timeout(struct usb_anchor *anchor, unsigned int timeout); extern struct urb *usb_get_from_anchor(struct usb_anchor *anchor); extern void usb_scuttle_anchored_urbs(struct usb_anchor *anchor); extern int usb_anchor_empty(struct usb_anchor *anchor); #define usb_unblock_urb usb_unpoison_urb /** * usb_urb_dir_in - check if an URB describes an IN transfer * @urb: URB to be checked * * Return: 1 if @urb describes an IN transfer (device-to-host), * otherwise 0. */ static inline int usb_urb_dir_in(struct urb *urb) { return (urb->transfer_flags & URB_DIR_MASK) == URB_DIR_IN; } /** * usb_urb_dir_out - check if an URB describes an OUT transfer * @urb: URB to be checked * * Return: 1 if @urb describes an OUT transfer (host-to-device), * otherwise 0. */ static inline int usb_urb_dir_out(struct urb *urb) { return (urb->transfer_flags & URB_DIR_MASK) == URB_DIR_OUT; } int usb_pipe_type_check(struct usb_device *dev, unsigned int pipe); int usb_urb_ep_type_check(const struct urb *urb); void *usb_alloc_coherent(struct usb_device *dev, size_t size, gfp_t mem_flags, dma_addr_t *dma); void usb_free_coherent(struct usb_device *dev, size_t size, void *addr, dma_addr_t dma); enum dma_data_direction; void *usb_alloc_noncoherent(struct usb_device *dev, size_t size, gfp_t mem_flags, dma_addr_t *dma, enum dma_data_direction dir, struct sg_table **table); void usb_free_noncoherent(struct usb_device *dev, size_t size, void *addr, enum dma_data_direction dir, struct sg_table *table); /*-------------------------------------------------------------------* * SYNCHRONOUS CALL SUPPORT * *-------------------------------------------------------------------*/ extern int usb_control_msg(struct usb_device *dev, unsigned int pipe, __u8 request, __u8 requesttype, __u16 value, __u16 index, void *data, __u16 size, int timeout); extern int usb_interrupt_msg(struct usb_device *usb_dev, unsigned int pipe, void *data, int len, int *actual_length, int timeout); extern int usb_bulk_msg(struct usb_device *usb_dev, unsigned int pipe, void *data, int len, int *actual_length, int timeout); /* wrappers around usb_control_msg() for the most common standard requests */ int usb_control_msg_send(struct usb_device *dev, __u8 endpoint, __u8 request, __u8 requesttype, __u16 value, __u16 index, const void *data, __u16 size, int timeout, gfp_t memflags); int usb_control_msg_recv(struct usb_device *dev, __u8 endpoint, __u8 request, __u8 requesttype, __u16 value, __u16 index, void *data, __u16 size, int timeout, gfp_t memflags); extern int usb_get_descriptor(struct usb_device *dev, unsigned char desctype, unsigned char descindex, void *buf, int size); extern int usb_get_status(struct usb_device *dev, int recip, int type, int target, void *data); static inline int usb_get_std_status(struct usb_device *dev, int recip, int target, void *data) { return usb_get_status(dev, recip, USB_STATUS_TYPE_STANDARD, target, data); } static inline int usb_get_ptm_status(struct usb_device *dev, void *data) { return usb_get_status(dev, USB_RECIP_DEVICE, USB_STATUS_TYPE_PTM, 0, data); } extern int usb_string(struct usb_device *dev, int index, char *buf, size_t size); extern char *usb_cache_string(struct usb_device *udev, int index); /* wrappers that also update important state inside usbcore */ extern int usb_clear_halt(struct usb_device *dev, int pipe); extern int usb_reset_configuration(struct usb_device *dev); extern int usb_set_interface(struct usb_device *dev, int ifnum, int alternate); extern void usb_reset_endpoint(struct usb_device *dev, unsigned int epaddr); /* this request isn't really synchronous, but it belongs with the others */ extern int usb_driver_set_configuration(struct usb_device *udev, int config); /* choose and set configuration for device */ extern int usb_choose_configuration(struct usb_device *udev); extern int usb_set_configuration(struct usb_device *dev, int configuration); /* * timeouts, in milliseconds, used for sending/receiving control messages * they typically complete within a few frames (msec) after they're issued * USB identifies 5 second timeouts, maybe more in a few cases, and a few * slow devices (like some MGE Ellipse UPSes) actually push that limit. */ #define USB_CTRL_GET_TIMEOUT 5000 #define USB_CTRL_SET_TIMEOUT 5000 /** * struct usb_sg_request - support for scatter/gather I/O * @status: zero indicates success, else negative errno * @bytes: counts bytes transferred. * * These requests are initialized using usb_sg_init(), and then are used * as request handles passed to usb_sg_wait() or usb_sg_cancel(). Most * members of the request object aren't for driver access. * * The status and bytecount values are valid only after usb_sg_wait() * returns. If the status is zero, then the bytecount matches the total * from the request. * * After an error completion, drivers may need to clear a halt condition * on the endpoint. */ struct usb_sg_request { int status; size_t bytes; /* private: * members below are private to usbcore, * and are not provided for driver access! */ spinlock_t lock; struct usb_device *dev; int pipe; int entries; struct urb **urbs; int count; struct completion complete; }; int usb_sg_init( struct usb_sg_request *io, struct usb_device *dev, unsigned pipe, unsigned period, struct scatterlist *sg, int nents, size_t length, gfp_t mem_flags ); void usb_sg_cancel(struct usb_sg_request *io); void usb_sg_wait(struct usb_sg_request *io); /* ----------------------------------------------------------------------- */ /* * For various legacy reasons, Linux has a small cookie that's paired with * a struct usb_device to identify an endpoint queue. Queue characteristics * are defined by the endpoint's descriptor. This cookie is called a "pipe", * an unsigned int encoded as: * * - direction: bit 7 (0 = Host-to-Device [Out], * 1 = Device-to-Host [In] ... * like endpoint bEndpointAddress) * - device address: bits 8-14 ... bit positions known to uhci-hcd * - endpoint: bits 15-18 ... bit positions known to uhci-hcd * - pipe type: bits 30-31 (00 = isochronous, 01 = interrupt, * 10 = control, 11 = bulk) * * Given the device address and endpoint descriptor, pipes are redundant. */ /* NOTE: these are not the standard USB_ENDPOINT_XFER_* values!! */ /* (yet ... they're the values used by usbfs) */ #define PIPE_ISOCHRONOUS 0 #define PIPE_INTERRUPT 1 #define PIPE_CONTROL 2 #define PIPE_BULK 3 #define usb_pipein(pipe) ((pipe) & USB_DIR_IN) #define usb_pipeout(pipe) (!usb_pipein(pipe)) #define usb_pipedevice(pipe) (((pipe) >> 8) & 0x7f) #define usb_pipeendpoint(pipe) (((pipe) >> 15) & 0xf) #define usb_pipetype(pipe) (((pipe) >> 30) & 3) #define usb_pipeisoc(pipe) (usb_pipetype((pipe)) == PIPE_ISOCHRONOUS) #define usb_pipeint(pipe) (usb_pipetype((pipe)) == PIPE_INTERRUPT) #define usb_pipecontrol(pipe) (usb_pipetype((pipe)) == PIPE_CONTROL) #define usb_pipebulk(pipe) (usb_pipetype((pipe)) == PIPE_BULK) static inline unsigned int __create_pipe(struct usb_device *dev, unsigned int endpoint) { return (dev->devnum << 8) | (endpoint << 15); } /* Create various pipes... */ #define usb_sndctrlpipe(dev, endpoint) \ ((PIPE_CONTROL << 30) | __create_pipe(dev, endpoint)) #define usb_rcvctrlpipe(dev, endpoint) \ ((PIPE_CONTROL << 30) | __create_pipe(dev, endpoint) | USB_DIR_IN) #define usb_sndisocpipe(dev, endpoint) \ ((PIPE_ISOCHRONOUS << 30) | __create_pipe(dev, endpoint)) #define usb_rcvisocpipe(dev, endpoint) \ ((PIPE_ISOCHRONOUS << 30) | __create_pipe(dev, endpoint) | USB_DIR_IN) #define usb_sndbulkpipe(dev, endpoint) \ ((PIPE_BULK << 30) | __create_pipe(dev, endpoint)) #define usb_rcvbulkpipe(dev, endpoint) \ ((PIPE_BULK << 30) | __create_pipe(dev, endpoint) | USB_DIR_IN) #define usb_sndintpipe(dev, endpoint) \ ((PIPE_INTERRUPT << 30) | __create_pipe(dev, endpoint)) #define usb_rcvintpipe(dev, endpoint) \ ((PIPE_INTERRUPT << 30) | __create_pipe(dev, endpoint) | USB_DIR_IN) static inline struct usb_host_endpoint * usb_pipe_endpoint(struct usb_device *dev, unsigned int pipe) { struct usb_host_endpoint **eps; eps = usb_pipein(pipe) ? dev->ep_in : dev->ep_out; return eps[usb_pipeendpoint(pipe)]; } static inline u16 usb_maxpacket(struct usb_device *udev, int pipe) { struct usb_host_endpoint *ep = usb_pipe_endpoint(udev, pipe); if (!ep) return 0; /* NOTE: only 0x07ff bits are for packet size... */ return usb_endpoint_maxp(&ep->desc); } u32 usb_endpoint_max_periodic_payload(struct usb_device *udev, const struct usb_host_endpoint *ep); bool usb_endpoint_is_hs_isoc_double(struct usb_device *udev, const struct usb_host_endpoint *ep); /* translate USB error codes to codes user space understands */ static inline int usb_translate_errors(int error_code) { switch (error_code) { case 0: case -ENOMEM: case -ENODEV: case -EOPNOTSUPP: return error_code; default: return -EIO; } } /* Events from the usb core */ #define USB_DEVICE_ADD 0x0001 #define USB_DEVICE_REMOVE 0x0002 #define USB_BUS_ADD 0x0003 #define USB_BUS_REMOVE 0x0004 extern void usb_register_notify(struct notifier_block *nb); extern void usb_unregister_notify(struct notifier_block *nb); /* debugfs stuff */ extern struct dentry *usb_debug_root; /* LED triggers */ enum usb_led_event { USB_LED_EVENT_HOST = 0, USB_LED_EVENT_GADGET = 1, }; #ifdef CONFIG_USB_LED_TRIG extern void usb_led_activity(enum usb_led_event ev); #else static inline void usb_led_activity(enum usb_led_event ev) {} #endif #endif /* __KERNEL__ */ #endif |
| 1476 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef __LINUX_COMPLETION_H #define __LINUX_COMPLETION_H /* * (C) Copyright 2001 Linus Torvalds * * Atomic wait-for-completion handler data structures. * See kernel/sched/completion.c for details. */ #include <linux/swait.h> /* * struct completion - structure used to maintain state for a "completion" * * This is the opaque structure used to maintain the state for a "completion". * Completions currently use a FIFO to queue threads that have to wait for * the "completion" event. * * See also: complete(), wait_for_completion() (and friends _timeout, * _interruptible, _interruptible_timeout, and _killable), init_completion(), * reinit_completion(), and macros DECLARE_COMPLETION(), * DECLARE_COMPLETION_ONSTACK(). */ struct completion { unsigned int done; struct swait_queue_head wait; }; #define init_completion_map(x, m) init_completion(x) static inline void complete_acquire(struct completion *x) {} static inline void complete_release(struct completion *x) {} #define COMPLETION_INITIALIZER(work) \ { 0, __SWAIT_QUEUE_HEAD_INITIALIZER((work).wait) } #define COMPLETION_INITIALIZER_ONSTACK_MAP(work, map) \ (*({ init_completion_map(&(work), &(map)); &(work); })) #define COMPLETION_INITIALIZER_ONSTACK(work) \ (*({ init_completion(&work); &work; })) /** * DECLARE_COMPLETION - declare and initialize a completion structure * @work: identifier for the completion structure * * This macro declares and initializes a completion structure. Generally used * for static declarations. You should use the _ONSTACK variant for automatic * variables. */ #define DECLARE_COMPLETION(work) \ struct completion work = COMPLETION_INITIALIZER(work) /* * Lockdep needs to run a non-constant initializer for on-stack * completions - so we use the _ONSTACK() variant for those that * are on the kernel stack: */ /** * DECLARE_COMPLETION_ONSTACK - declare and initialize a completion structure * @work: identifier for the completion structure * * This macro declares and initializes a completion structure on the kernel * stack. */ #ifdef CONFIG_LOCKDEP # define DECLARE_COMPLETION_ONSTACK(work) \ struct completion work = COMPLETION_INITIALIZER_ONSTACK(work) # define DECLARE_COMPLETION_ONSTACK_MAP(work, map) \ struct completion work = COMPLETION_INITIALIZER_ONSTACK_MAP(work, map) #else # define DECLARE_COMPLETION_ONSTACK(work) DECLARE_COMPLETION(work) # define DECLARE_COMPLETION_ONSTACK_MAP(work, map) DECLARE_COMPLETION(work) #endif /** * init_completion - Initialize a dynamically allocated completion * @x: pointer to completion structure that is to be initialized * * This inline function will initialize a dynamically created completion * structure. */ static inline void init_completion(struct completion *x) { x->done = 0; init_swait_queue_head(&x->wait); } /** * reinit_completion - reinitialize a completion structure * @x: pointer to completion structure that is to be reinitialized * * This inline function should be used to reinitialize a completion structure so it can * be reused. This is especially important after complete_all() is used. */ static inline void reinit_completion(struct completion *x) { x->done = 0; } extern void wait_for_completion(struct completion *); extern void wait_for_completion_io(struct completion *); extern int wait_for_completion_interruptible(struct completion *x); extern int wait_for_completion_killable(struct completion *x); extern int wait_for_completion_state(struct completion *x, unsigned int state); extern unsigned long wait_for_completion_timeout(struct completion *x, unsigned long timeout); extern unsigned long wait_for_completion_io_timeout(struct completion *x, unsigned long timeout); extern long wait_for_completion_interruptible_timeout( struct completion *x, unsigned long timeout); extern long wait_for_completion_killable_timeout( struct completion *x, unsigned long timeout); extern bool try_wait_for_completion(struct completion *x); extern bool completion_done(struct completion *x); extern void complete(struct completion *); extern void complete_on_current_cpu(struct completion *x); extern void complete_all(struct completion *); #endif |
| 18 19 11 10 7 11 4 7 11 11 11 11 11 11 11 7 11 4 7 11 11 7 7 7 6 7 7 7 7 7 7 3 3 4 7 11 11 11 4 4 7 7 7 7 7 7 7 7 7 4 4 4 4 11 11 11 11 11 11 11 11 7 11 7 3 7 3 7 3 3 7 7 6 7 7 7 7 7 7 7 7 7 7 7 7 7 11 11 11 11 11 11 11 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 8 19 67 4 4 4 4 4 7 7 7 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 13 13 13 13 14 21 4 3 8 14 7 21 17 18 18 18 18 11 18 10 18 17 18 11 18 18 17 17 14 14 7 7 18 18 18 18 11 11 11 18 18 1 1 1 2 2 2 1 1 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 | // SPDX-License-Identifier: GPL-2.0-only #include <linux/kernel.h> #include <linux/netdevice.h> #include <linux/rtnetlink.h> #include <linux/slab.h> #include <net/switchdev.h> #include "br_private.h" #include "br_private_tunnel.h" static void nbp_vlan_set_vlan_dev_state(struct net_bridge_port *p, u16 vid); static inline int br_vlan_cmp(struct rhashtable_compare_arg *arg, const void *ptr) { const struct net_bridge_vlan *vle = ptr; u16 vid = *(u16 *)arg->key; return vle->vid != vid; } static const struct rhashtable_params br_vlan_rht_params = { .head_offset = offsetof(struct net_bridge_vlan, vnode), .key_offset = offsetof(struct net_bridge_vlan, vid), .key_len = sizeof(u16), .nelem_hint = 3, .max_size = VLAN_N_VID, .obj_cmpfn = br_vlan_cmp, .automatic_shrinking = true, }; static struct net_bridge_vlan *br_vlan_lookup(struct rhashtable *tbl, u16 vid) { return rhashtable_lookup_fast(tbl, &vid, br_vlan_rht_params); } static void __vlan_add_pvid(struct net_bridge_vlan_group *vg, const struct net_bridge_vlan *v) { if (vg->pvid == v->vid) return; smp_wmb(); br_vlan_set_pvid_state(vg, v->state); vg->pvid = v->vid; } static void __vlan_delete_pvid(struct net_bridge_vlan_group *vg, u16 vid) { if (vg->pvid != vid) return; smp_wmb(); vg->pvid = 0; } /* Update the BRIDGE_VLAN_INFO_PVID and BRIDGE_VLAN_INFO_UNTAGGED flags of @v. * If @commit is false, return just whether the BRIDGE_VLAN_INFO_PVID and * BRIDGE_VLAN_INFO_UNTAGGED bits of @flags would produce any change onto @v. */ static bool __vlan_flags_update(struct net_bridge_vlan *v, u16 flags, bool commit) { struct net_bridge_vlan_group *vg; bool change; if (br_vlan_is_master(v)) vg = br_vlan_group(v->br); else vg = nbp_vlan_group(v->port); /* check if anything would be changed on commit */ change = !!(flags & BRIDGE_VLAN_INFO_PVID) == !!(vg->pvid != v->vid) || ((flags ^ v->flags) & BRIDGE_VLAN_INFO_UNTAGGED); if (!commit) goto out; if (flags & BRIDGE_VLAN_INFO_PVID) __vlan_add_pvid(vg, v); else __vlan_delete_pvid(vg, v->vid); if (flags & BRIDGE_VLAN_INFO_UNTAGGED) v->flags |= BRIDGE_VLAN_INFO_UNTAGGED; else v->flags &= ~BRIDGE_VLAN_INFO_UNTAGGED; out: return change; } static bool __vlan_flags_would_change(struct net_bridge_vlan *v, u16 flags) { return __vlan_flags_update(v, flags, false); } static void __vlan_flags_commit(struct net_bridge_vlan *v, u16 flags) { __vlan_flags_update(v, flags, true); } static int __vlan_vid_add(struct net_device *dev, struct net_bridge *br, struct net_bridge_vlan *v, u16 flags, struct netlink_ext_ack *extack) { int err; /* Try switchdev op first. In case it is not supported, fallback to * 8021q add. */ err = br_switchdev_port_vlan_add(dev, v->vid, flags, false, extack); if (err == -EOPNOTSUPP) return vlan_vid_add(dev, br->vlan_proto, v->vid); v->priv_flags |= BR_VLFLAG_ADDED_BY_SWITCHDEV; return err; } static void __vlan_add_list(struct net_bridge_vlan *v) { struct net_bridge_vlan_group *vg; struct list_head *headp, *hpos; struct net_bridge_vlan *vent; if (br_vlan_is_master(v)) vg = br_vlan_group(v->br); else vg = nbp_vlan_group(v->port); headp = &vg->vlan_list; list_for_each_prev(hpos, headp) { vent = list_entry(hpos, struct net_bridge_vlan, vlist); if (v->vid >= vent->vid) break; } list_add_rcu(&v->vlist, hpos); } static void __vlan_del_list(struct net_bridge_vlan *v) { list_del_rcu(&v->vlist); } static int __vlan_vid_del(struct net_device *dev, struct net_bridge *br, const struct net_bridge_vlan *v) { int err; /* Try switchdev op first. In case it is not supported, fallback to * 8021q del. */ err = br_switchdev_port_vlan_del(dev, v->vid); if (!(v->priv_flags & BR_VLFLAG_ADDED_BY_SWITCHDEV)) vlan_vid_del(dev, br->vlan_proto, v->vid); return err == -EOPNOTSUPP ? 0 : err; } /* Returns a master vlan, if it didn't exist it gets created. In all cases * a reference is taken to the master vlan before returning. */ static struct net_bridge_vlan * br_vlan_get_master(struct net_bridge *br, u16 vid, struct netlink_ext_ack *extack) { struct net_bridge_vlan_group *vg; struct net_bridge_vlan *masterv; vg = br_vlan_group(br); masterv = br_vlan_find(vg, vid); if (!masterv) { bool changed; /* missing global ctx, create it now */ if (br_vlan_add(br, vid, 0, &changed, extack)) return NULL; masterv = br_vlan_find(vg, vid); if (WARN_ON(!masterv)) return NULL; refcount_set(&masterv->refcnt, 1); return masterv; } refcount_inc(&masterv->refcnt); return masterv; } static void br_master_vlan_rcu_free(struct rcu_head *rcu) { struct net_bridge_vlan *v; v = container_of(rcu, struct net_bridge_vlan, rcu); WARN_ON(!br_vlan_is_master(v)); free_percpu(v->stats); v->stats = NULL; kfree(v); } static void br_vlan_put_master(struct net_bridge_vlan *masterv) { struct net_bridge_vlan_group *vg; if (!br_vlan_is_master(masterv)) return; vg = br_vlan_group(masterv->br); if (refcount_dec_and_test(&masterv->refcnt)) { rhashtable_remove_fast(&vg->vlan_hash, &masterv->vnode, br_vlan_rht_params); __vlan_del_list(masterv); br_multicast_toggle_one_vlan(masterv, false); br_multicast_ctx_deinit(&masterv->br_mcast_ctx); call_rcu(&masterv->rcu, br_master_vlan_rcu_free); } } static void nbp_vlan_rcu_free(struct rcu_head *rcu) { struct net_bridge_vlan *v; v = container_of(rcu, struct net_bridge_vlan, rcu); WARN_ON(br_vlan_is_master(v)); /* if we had per-port stats configured then free them here */ if (v->priv_flags & BR_VLFLAG_PER_PORT_STATS) free_percpu(v->stats); v->stats = NULL; kfree(v); } static void br_vlan_init_state(struct net_bridge_vlan *v) { struct net_bridge *br; if (br_vlan_is_master(v)) br = v->br; else br = v->port->br; if (br_opt_get(br, BROPT_MST_ENABLED)) { br_mst_vlan_init_state(v); return; } v->state = BR_STATE_FORWARDING; v->msti = 0; } /* This is the shared VLAN add function which works for both ports and bridge * devices. There are four possible calls to this function in terms of the * vlan entry type: * 1. vlan is being added on a port (no master flags, global entry exists) * 2. vlan is being added on a bridge (both master and brentry flags) * 3. vlan is being added on a port, but a global entry didn't exist which * is being created right now (master flag set, brentry flag unset), the * global entry is used for global per-vlan features, but not for filtering * 4. same as 3 but with both master and brentry flags set so the entry * will be used for filtering in both the port and the bridge */ static int __vlan_add(struct net_bridge_vlan *v, u16 flags, struct netlink_ext_ack *extack) { struct net_bridge_vlan *masterv = NULL; struct net_bridge_port *p = NULL; struct net_bridge_vlan_group *vg; struct net_device *dev; struct net_bridge *br; int err; if (br_vlan_is_master(v)) { br = v->br; dev = br->dev; vg = br_vlan_group(br); } else { p = v->port; br = p->br; dev = p->dev; vg = nbp_vlan_group(p); } if (p) { /* Add VLAN to the device filter if it is supported. * This ensures tagged traffic enters the bridge when * promiscuous mode is disabled by br_manage_promisc(). */ err = __vlan_vid_add(dev, br, v, flags, extack); if (err) goto out; /* need to work on the master vlan too */ if (flags & BRIDGE_VLAN_INFO_MASTER) { bool changed; err = br_vlan_add(br, v->vid, flags | BRIDGE_VLAN_INFO_BRENTRY, &changed, extack); if (err) goto out_filt; if (changed) br_vlan_notify(br, NULL, v->vid, 0, RTM_NEWVLAN); } masterv = br_vlan_get_master(br, v->vid, extack); if (!masterv) { err = -ENOMEM; goto out_filt; } v->brvlan = masterv; if (br_opt_get(br, BROPT_VLAN_STATS_PER_PORT)) { v->stats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats); if (!v->stats) { err = -ENOMEM; goto out_filt; } v->priv_flags |= BR_VLFLAG_PER_PORT_STATS; } else { v->stats = masterv->stats; } br_multicast_port_ctx_init(p, v, &v->port_mcast_ctx); } else { if (br_vlan_should_use(v)) { err = br_switchdev_port_vlan_add(dev, v->vid, flags, false, extack); if (err && err != -EOPNOTSUPP) goto out; } br_multicast_ctx_init(br, v, &v->br_mcast_ctx); v->priv_flags |= BR_VLFLAG_GLOBAL_MCAST_ENABLED; } /* Add the dev mac and count the vlan only if it's usable */ if (br_vlan_should_use(v)) { if (!br_opt_get(br, BROPT_FDB_LOCAL_VLAN_0)) { err = br_fdb_add_local(br, p, dev->dev_addr, v->vid); if (err) { br_err(br, "failed insert local address into bridge forwarding table\n"); goto out_filt; } } vg->num_vlans++; } /* set the state before publishing */ br_vlan_init_state(v); err = rhashtable_lookup_insert_fast(&vg->vlan_hash, &v->vnode, br_vlan_rht_params); if (err) goto out_fdb_insert; __vlan_add_list(v); __vlan_flags_commit(v, flags); br_multicast_toggle_one_vlan(v, true); if (p) nbp_vlan_set_vlan_dev_state(p, v->vid); out: return err; out_fdb_insert: if (br_vlan_should_use(v)) { br_fdb_find_delete_local(br, p, dev->dev_addr, v->vid); vg->num_vlans--; } out_filt: if (p) { __vlan_vid_del(dev, br, v); if (masterv) { if (v->stats && masterv->stats != v->stats) free_percpu(v->stats); v->stats = NULL; br_vlan_put_master(masterv); v->brvlan = NULL; } } else { br_switchdev_port_vlan_del(dev, v->vid); } goto out; } static int __vlan_del(struct net_bridge_vlan *v) { struct net_bridge_vlan *masterv = v; struct net_bridge_vlan_group *vg; struct net_bridge_port *p = NULL; int err = 0; if (br_vlan_is_master(v)) { vg = br_vlan_group(v->br); } else { p = v->port; vg = nbp_vlan_group(v->port); masterv = v->brvlan; } __vlan_delete_pvid(vg, v->vid); if (p) { err = __vlan_vid_del(p->dev, p->br, v); if (err) goto out; } else { err = br_switchdev_port_vlan_del(v->br->dev, v->vid); if (err && err != -EOPNOTSUPP) goto out; err = 0; } if (br_vlan_should_use(v)) { v->flags &= ~BRIDGE_VLAN_INFO_BRENTRY; vg->num_vlans--; } if (masterv != v) { vlan_tunnel_info_del(vg, v); rhashtable_remove_fast(&vg->vlan_hash, &v->vnode, br_vlan_rht_params); __vlan_del_list(v); nbp_vlan_set_vlan_dev_state(p, v->vid); br_multicast_toggle_one_vlan(v, false); br_multicast_port_ctx_deinit(&v->port_mcast_ctx); call_rcu(&v->rcu, nbp_vlan_rcu_free); } br_vlan_put_master(masterv); out: return err; } static void __vlan_group_free(struct net_bridge_vlan_group *vg) { WARN_ON(!list_empty(&vg->vlan_list)); rhashtable_destroy(&vg->vlan_hash); vlan_tunnel_deinit(vg); kfree(vg); } static void __vlan_flush(const struct net_bridge *br, const struct net_bridge_port *p, struct net_bridge_vlan_group *vg) { struct net_bridge_vlan *vlan, *tmp; u16 v_start = 0, v_end = 0; int err; __vlan_delete_pvid(vg, vg->pvid); list_for_each_entry_safe(vlan, tmp, &vg->vlan_list, vlist) { /* take care of disjoint ranges */ if (!v_start) { v_start = vlan->vid; } else if (vlan->vid - v_end != 1) { /* found range end, notify and start next one */ br_vlan_notify(br, p, v_start, v_end, RTM_DELVLAN); v_start = vlan->vid; } v_end = vlan->vid; err = __vlan_del(vlan); if (err) { br_err(br, "port %u(%s) failed to delete vlan %d: %pe\n", (unsigned int) p->port_no, p->dev->name, vlan->vid, ERR_PTR(err)); } } /* notify about the last/whole vlan range */ if (v_start) br_vlan_notify(br, p, v_start, v_end, RTM_DELVLAN); } struct sk_buff *br_handle_vlan(struct net_bridge *br, const struct net_bridge_port *p, struct net_bridge_vlan_group *vg, struct sk_buff *skb) { struct pcpu_sw_netstats *stats; struct net_bridge_vlan *v; u16 vid; /* If this packet was not filtered at input, let it pass */ if (!BR_INPUT_SKB_CB(skb)->vlan_filtered) goto out; /* At this point, we know that the frame was filtered and contains * a valid vlan id. If the vlan id has untagged flag set, * send untagged; otherwise, send tagged. */ br_vlan_get_tag(skb, &vid); v = br_vlan_find(vg, vid); /* Vlan entry must be configured at this point. The * only exception is the bridge is set in promisc mode and the * packet is destined for the bridge device. In this case * pass the packet as is. */ if (!v || !br_vlan_should_use(v)) { if ((br->dev->flags & IFF_PROMISC) && skb->dev == br->dev) { goto out; } else { kfree_skb(skb); return NULL; } } if (br_opt_get(br, BROPT_VLAN_STATS_ENABLED)) { stats = this_cpu_ptr(v->stats); u64_stats_update_begin(&stats->syncp); u64_stats_add(&stats->tx_bytes, skb->len); u64_stats_inc(&stats->tx_packets); u64_stats_update_end(&stats->syncp); } /* If the skb will be sent using forwarding offload, the assumption is * that the switchdev will inject the packet into hardware together * with the bridge VLAN, so that it can be forwarded according to that * VLAN. The switchdev should deal with popping the VLAN header in * hardware on each egress port as appropriate. So only strip the VLAN * header if forwarding offload is not being used. */ if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED && !br_switchdev_frame_uses_tx_fwd_offload(skb)) __vlan_hwaccel_clear_tag(skb); if (p && (p->flags & BR_VLAN_TUNNEL) && br_handle_egress_vlan_tunnel(skb, v)) { kfree_skb(skb); return NULL; } out: return skb; } /* Called under RCU */ static bool __allowed_ingress(const struct net_bridge *br, struct net_bridge_vlan_group *vg, struct sk_buff *skb, u16 *vid, u8 *state, struct net_bridge_vlan **vlan) { struct pcpu_sw_netstats *stats; struct net_bridge_vlan *v; bool tagged; BR_INPUT_SKB_CB(skb)->vlan_filtered = true; /* If vlan tx offload is disabled on bridge device and frame was * sent from vlan device on the bridge device, it does not have * HW accelerated vlan tag. */ if (unlikely(!skb_vlan_tag_present(skb) && skb->protocol == br->vlan_proto)) { skb = skb_vlan_untag(skb); if (unlikely(!skb)) return false; } if (!br_vlan_get_tag(skb, vid)) { /* Tagged frame */ if (skb->vlan_proto != br->vlan_proto) { /* Protocol-mismatch, empty out vlan_tci for new tag */ skb_push(skb, ETH_HLEN); skb = vlan_insert_tag_set_proto(skb, skb->vlan_proto, skb_vlan_tag_get(skb)); if (unlikely(!skb)) return false; skb_pull(skb, ETH_HLEN); skb_reset_mac_len(skb); *vid = 0; tagged = false; } else { tagged = true; } } else { /* Untagged frame */ tagged = false; } if (!*vid) { u16 pvid = br_get_pvid(vg); /* Frame had a tag with VID 0 or did not have a tag. * See if pvid is set on this port. That tells us which * vlan untagged or priority-tagged traffic belongs to. */ if (!pvid) goto drop; /* PVID is set on this port. Any untagged or priority-tagged * ingress frame is considered to belong to this vlan. */ *vid = pvid; if (likely(!tagged)) /* Untagged Frame. */ __vlan_hwaccel_put_tag(skb, br->vlan_proto, pvid); else /* Priority-tagged Frame. * At this point, we know that skb->vlan_tci VID * field was 0. * We update only VID field and preserve PCP field. */ skb->vlan_tci |= pvid; /* if snooping and stats are disabled we can avoid the lookup */ if (!br_opt_get(br, BROPT_MCAST_VLAN_SNOOPING_ENABLED) && !br_opt_get(br, BROPT_VLAN_STATS_ENABLED)) { if (*state == BR_STATE_FORWARDING) { *state = br_vlan_get_pvid_state(vg); if (!br_vlan_state_allowed(*state, true)) goto drop; } return true; } } v = br_vlan_find(vg, *vid); if (!v || !br_vlan_should_use(v)) goto drop; if (*state == BR_STATE_FORWARDING) { *state = br_vlan_get_state(v); if (!br_vlan_state_allowed(*state, true)) goto drop; } if (br_opt_get(br, BROPT_VLAN_STATS_ENABLED)) { stats = this_cpu_ptr(v->stats); u64_stats_update_begin(&stats->syncp); u64_stats_add(&stats->rx_bytes, skb->len); u64_stats_inc(&stats->rx_packets); u64_stats_update_end(&stats->syncp); } *vlan = v; return true; drop: kfree_skb(skb); return false; } bool br_allowed_ingress(const struct net_bridge *br, struct net_bridge_vlan_group *vg, struct sk_buff *skb, u16 *vid, u8 *state, struct net_bridge_vlan **vlan) { /* If VLAN filtering is disabled on the bridge, all packets are * permitted. */ *vlan = NULL; if (!br_opt_get(br, BROPT_VLAN_ENABLED)) { BR_INPUT_SKB_CB(skb)->vlan_filtered = false; return true; } return __allowed_ingress(br, vg, skb, vid, state, vlan); } /* Called under RCU. */ bool br_allowed_egress(struct net_bridge_vlan_group *vg, const struct sk_buff *skb) { const struct net_bridge_vlan *v; u16 vid; /* If this packet was not filtered at input, let it pass */ if (!BR_INPUT_SKB_CB(skb)->vlan_filtered) return true; br_vlan_get_tag(skb, &vid); v = br_vlan_find(vg, vid); if (v && br_vlan_should_use(v) && br_vlan_state_allowed(br_vlan_get_state(v), false)) return true; return false; } /* Called under RCU */ bool br_should_learn(struct net_bridge_port *p, struct sk_buff *skb, u16 *vid) { struct net_bridge_vlan_group *vg; struct net_bridge *br = p->br; struct net_bridge_vlan *v; /* If filtering was disabled at input, let it pass. */ if (!br_opt_get(br, BROPT_VLAN_ENABLED)) return true; vg = nbp_vlan_group_rcu(p); if (!vg || !vg->num_vlans) return false; if (!br_vlan_get_tag(skb, vid) && skb->vlan_proto != br->vlan_proto) *vid = 0; if (!*vid) { *vid = br_get_pvid(vg); if (!*vid || !br_vlan_state_allowed(br_vlan_get_pvid_state(vg), true)) return false; return true; } v = br_vlan_find(vg, *vid); if (v && br_vlan_state_allowed(br_vlan_get_state(v), true)) return true; return false; } static int br_vlan_add_existing(struct net_bridge *br, struct net_bridge_vlan_group *vg, struct net_bridge_vlan *vlan, u16 flags, bool *changed, struct netlink_ext_ack *extack) { bool becomes_brentry = false; bool would_change = false; int err; if (!br_vlan_is_brentry(vlan)) { /* Trying to change flags of non-existent bridge vlan */ if (!(flags & BRIDGE_VLAN_INFO_BRENTRY)) return -EINVAL; becomes_brentry = true; } else { would_change = __vlan_flags_would_change(vlan, flags); } /* Master VLANs that aren't brentries weren't notified before, * time to notify them now. */ if (becomes_brentry || would_change) { err = br_switchdev_port_vlan_add(br->dev, vlan->vid, flags, would_change, extack); if (err && err != -EOPNOTSUPP) return err; } if (becomes_brentry) { /* It was only kept for port vlans, now make it real */ err = br_fdb_add_local(br, NULL, br->dev->dev_addr, vlan->vid); if (err) { br_err(br, "failed to insert local address into bridge forwarding table\n"); goto err_fdb_insert; } refcount_inc(&vlan->refcnt); vlan->flags |= BRIDGE_VLAN_INFO_BRENTRY; vg->num_vlans++; *changed = true; br_multicast_toggle_one_vlan(vlan, true); } __vlan_flags_commit(vlan, flags); if (would_change) *changed = true; return 0; err_fdb_insert: br_switchdev_port_vlan_del(br->dev, vlan->vid); return err; } /* Must be protected by RTNL. * Must be called with vid in range from 1 to 4094 inclusive. * changed must be true only if the vlan was created or updated */ int br_vlan_add(struct net_bridge *br, u16 vid, u16 flags, bool *changed, struct netlink_ext_ack *extack) { struct net_bridge_vlan_group *vg; struct net_bridge_vlan *vlan; int ret; ASSERT_RTNL(); *changed = false; vg = br_vlan_group(br); vlan = br_vlan_find(vg, vid); if (vlan) return br_vlan_add_existing(br, vg, vlan, flags, changed, extack); vlan = kzalloc(sizeof(*vlan), GFP_KERNEL); if (!vlan) return -ENOMEM; vlan->stats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats); if (!vlan->stats) { kfree(vlan); return -ENOMEM; } vlan->vid = vid; vlan->flags = flags | BRIDGE_VLAN_INFO_MASTER; vlan->flags &= ~BRIDGE_VLAN_INFO_PVID; vlan->br = br; if (flags & BRIDGE_VLAN_INFO_BRENTRY) refcount_set(&vlan->refcnt, 1); ret = __vlan_add(vlan, flags, extack); if (ret) { free_percpu(vlan->stats); kfree(vlan); } else { *changed = true; } return ret; } /* Must be protected by RTNL. * Must be called with vid in range from 1 to 4094 inclusive. */ int br_vlan_delete(struct net_bridge *br, u16 vid) { struct net_bridge_vlan_group *vg; struct net_bridge_vlan *v; ASSERT_RTNL(); vg = br_vlan_group(br); v = br_vlan_find(vg, vid); if (!v || !br_vlan_is_brentry(v)) return -ENOENT; br_fdb_find_delete_local(br, NULL, br->dev->dev_addr, vid); br_fdb_delete_by_port(br, NULL, vid, 0); vlan_tunnel_info_del(vg, v); return __vlan_del(v); } void br_vlan_flush(struct net_bridge *br) { struct net_bridge_vlan_group *vg; ASSERT_RTNL(); vg = br_vlan_group(br); __vlan_flush(br, NULL, vg); RCU_INIT_POINTER(br->vlgrp, NULL); synchronize_net(); __vlan_group_free(vg); } struct net_bridge_vlan *br_vlan_find(struct net_bridge_vlan_group *vg, u16 vid) { if (!vg) return NULL; return br_vlan_lookup(&vg->vlan_hash, vid); } /* Must be protected by RTNL. */ static void recalculate_group_addr(struct net_bridge *br) { if (br_opt_get(br, BROPT_GROUP_ADDR_SET)) return; spin_lock_bh(&br->lock); if (!br_opt_get(br, BROPT_VLAN_ENABLED) || br->vlan_proto == htons(ETH_P_8021Q)) { /* Bridge Group Address */ br->group_addr[5] = 0x00; } else { /* vlan_enabled && ETH_P_8021AD */ /* Provider Bridge Group Address */ br->group_addr[5] = 0x08; } spin_unlock_bh(&br->lock); } /* Must be protected by RTNL. */ void br_recalculate_fwd_mask(struct net_bridge *br) { if (!br_opt_get(br, BROPT_VLAN_ENABLED) || br->vlan_proto == htons(ETH_P_8021Q)) br->group_fwd_mask_required = BR_GROUPFWD_DEFAULT; else /* vlan_enabled && ETH_P_8021AD */ br->group_fwd_mask_required = BR_GROUPFWD_8021AD & ~(1u << br->group_addr[5]); } int br_vlan_filter_toggle(struct net_bridge *br, unsigned long val, struct netlink_ext_ack *extack) { struct switchdev_attr attr = { .orig_dev = br->dev, .id = SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING, .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP, .u.vlan_filtering = val, }; int err; if (br_opt_get(br, BROPT_VLAN_ENABLED) == !!val) return 0; br_opt_toggle(br, BROPT_VLAN_ENABLED, !!val); err = switchdev_port_attr_set(br->dev, &attr, extack); if (err && err != -EOPNOTSUPP) { br_opt_toggle(br, BROPT_VLAN_ENABLED, !val); return err; } br_manage_promisc(br); recalculate_group_addr(br); br_recalculate_fwd_mask(br); if (!val && br_opt_get(br, BROPT_MCAST_VLAN_SNOOPING_ENABLED)) { br_info(br, "vlan filtering disabled, automatically disabling multicast vlan snooping\n"); br_multicast_toggle_vlan_snooping(br, false, NULL); } return 0; } bool br_vlan_enabled(const struct net_device *dev) { struct net_bridge *br = netdev_priv(dev); return br_opt_get(br, BROPT_VLAN_ENABLED); } EXPORT_SYMBOL_GPL(br_vlan_enabled); int br_vlan_get_proto(const struct net_device *dev, u16 *p_proto) { struct net_bridge *br = netdev_priv(dev); *p_proto = ntohs(br->vlan_proto); return 0; } EXPORT_SYMBOL_GPL(br_vlan_get_proto); int __br_vlan_set_proto(struct net_bridge *br, __be16 proto, struct netlink_ext_ack *extack) { struct switchdev_attr attr = { .orig_dev = br->dev, .id = SWITCHDEV_ATTR_ID_BRIDGE_VLAN_PROTOCOL, .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP, .u.vlan_protocol = ntohs(proto), }; int err = 0; struct net_bridge_port *p; struct net_bridge_vlan *vlan; struct net_bridge_vlan_group *vg; __be16 oldproto = br->vlan_proto; if (br->vlan_proto == proto) return 0; err = switchdev_port_attr_set(br->dev, &attr, extack); if (err && err != -EOPNOTSUPP) return err; /* Add VLANs for the new proto to the device filter. */ list_for_each_entry(p, &br->port_list, list) { vg = nbp_vlan_group(p); list_for_each_entry(vlan, &vg->vlan_list, vlist) { if (vlan->priv_flags & BR_VLFLAG_ADDED_BY_SWITCHDEV) continue; err = vlan_vid_add(p->dev, proto, vlan->vid); if (err) goto err_filt; } } br->vlan_proto = proto; recalculate_group_addr(br); br_recalculate_fwd_mask(br); /* Delete VLANs for the old proto from the device filter. */ list_for_each_entry(p, &br->port_list, list) { vg = nbp_vlan_group(p); list_for_each_entry(vlan, &vg->vlan_list, vlist) { if (vlan->priv_flags & BR_VLFLAG_ADDED_BY_SWITCHDEV) continue; vlan_vid_del(p->dev, oldproto, vlan->vid); } } return 0; err_filt: attr.u.vlan_protocol = ntohs(oldproto); switchdev_port_attr_set(br->dev, &attr, NULL); list_for_each_entry_continue_reverse(vlan, &vg->vlan_list, vlist) { if (vlan->priv_flags & BR_VLFLAG_ADDED_BY_SWITCHDEV) continue; vlan_vid_del(p->dev, proto, vlan->vid); } list_for_each_entry_continue_reverse(p, &br->port_list, list) { vg = nbp_vlan_group(p); list_for_each_entry(vlan, &vg->vlan_list, vlist) { if (vlan->priv_flags & BR_VLFLAG_ADDED_BY_SWITCHDEV) continue; vlan_vid_del(p->dev, proto, vlan->vid); } } return err; } int br_vlan_set_proto(struct net_bridge *br, unsigned long val, struct netlink_ext_ack *extack) { if (!eth_type_vlan(htons(val))) return -EPROTONOSUPPORT; return __br_vlan_set_proto(br, htons(val), extack); } int br_vlan_set_stats(struct net_bridge *br, unsigned long val) { switch (val) { case 0: case 1: br_opt_toggle(br, BROPT_VLAN_STATS_ENABLED, !!val); break; default: return -EINVAL; } return 0; } int br_vlan_set_stats_per_port(struct net_bridge *br, unsigned long val) { struct net_bridge_port *p; /* allow to change the option if there are no port vlans configured */ list_for_each_entry(p, &br->port_list, list) { struct net_bridge_vlan_group *vg = nbp_vlan_group(p); if (vg->num_vlans) return -EBUSY; } switch (val) { case 0: case 1: br_opt_toggle(br, BROPT_VLAN_STATS_PER_PORT, !!val); break; default: return -EINVAL; } return 0; } static bool vlan_default_pvid(struct net_bridge_vlan_group *vg, u16 vid) { struct net_bridge_vlan *v; if (vid != vg->pvid) return false; v = br_vlan_lookup(&vg->vlan_hash, vid); if (v && br_vlan_should_use(v) && (v->flags & BRIDGE_VLAN_INFO_UNTAGGED)) return true; return false; } static void br_vlan_disable_default_pvid(struct net_bridge *br) { struct net_bridge_port *p; u16 pvid = br->default_pvid; /* Disable default_pvid on all ports where it is still * configured. */ if (vlan_default_pvid(br_vlan_group(br), pvid)) { if (!br_vlan_delete(br, pvid)) br_vlan_notify(br, NULL, pvid, 0, RTM_DELVLAN); } list_for_each_entry(p, &br->port_list, list) { if (vlan_default_pvid(nbp_vlan_group(p), pvid) && !nbp_vlan_delete(p, pvid)) br_vlan_notify(br, p, pvid, 0, RTM_DELVLAN); } br->default_pvid = 0; } int __br_vlan_set_default_pvid(struct net_bridge *br, u16 pvid, struct netlink_ext_ack *extack) { const struct net_bridge_vlan *pvent; struct net_bridge_vlan_group *vg; struct net_bridge_port *p; unsigned long *changed; bool vlchange; u16 old_pvid; int err = 0; if (!pvid) { br_vlan_disable_default_pvid(br); return 0; } changed = bitmap_zalloc(BR_MAX_PORTS, GFP_KERNEL); if (!changed) return -ENOMEM; old_pvid = br->default_pvid; /* Update default_pvid config only if we do not conflict with * user configuration. */ vg = br_vlan_group(br); pvent = br_vlan_find(vg, pvid); if ((!old_pvid || vlan_default_pvid(vg, old_pvid)) && (!pvent || !br_vlan_should_use(pvent))) { err = br_vlan_add(br, pvid, BRIDGE_VLAN_INFO_PVID | BRIDGE_VLAN_INFO_UNTAGGED | BRIDGE_VLAN_INFO_BRENTRY, &vlchange, extack); if (err) goto out; if (br_vlan_delete(br, old_pvid)) br_vlan_notify(br, NULL, old_pvid, 0, RTM_DELVLAN); br_vlan_notify(br, NULL, pvid, 0, RTM_NEWVLAN); __set_bit(0, changed); } list_for_each_entry(p, &br->port_list, list) { /* Update default_pvid config only if we do not conflict with * user configuration. */ vg = nbp_vlan_group(p); if ((old_pvid && !vlan_default_pvid(vg, old_pvid)) || br_vlan_find(vg, pvid)) continue; err = nbp_vlan_add(p, pvid, BRIDGE_VLAN_INFO_PVID | BRIDGE_VLAN_INFO_UNTAGGED, &vlchange, extack); if (err) goto err_port; if (nbp_vlan_delete(p, old_pvid)) br_vlan_notify(br, p, old_pvid, 0, RTM_DELVLAN); br_vlan_notify(p->br, p, pvid, 0, RTM_NEWVLAN); __set_bit(p->port_no, changed); } br->default_pvid = pvid; out: bitmap_free(changed); return err; err_port: list_for_each_entry_continue_reverse(p, &br->port_list, list) { if (!test_bit(p->port_no, changed)) continue; if (old_pvid) { nbp_vlan_add(p, old_pvid, BRIDGE_VLAN_INFO_PVID | BRIDGE_VLAN_INFO_UNTAGGED, &vlchange, NULL); br_vlan_notify(p->br, p, old_pvid, 0, RTM_NEWVLAN); } nbp_vlan_delete(p, pvid); br_vlan_notify(br, p, pvid, 0, RTM_DELVLAN); } if (test_bit(0, changed)) { if (old_pvid) { br_vlan_add(br, old_pvid, BRIDGE_VLAN_INFO_PVID | BRIDGE_VLAN_INFO_UNTAGGED | BRIDGE_VLAN_INFO_BRENTRY, &vlchange, NULL); br_vlan_notify(br, NULL, old_pvid, 0, RTM_NEWVLAN); } br_vlan_delete(br, pvid); br_vlan_notify(br, NULL, pvid, 0, RTM_DELVLAN); } goto out; } int br_vlan_set_default_pvid(struct net_bridge *br, unsigned long val, struct netlink_ext_ack *extack) { u16 pvid = val; int err = 0; if (val >= VLAN_VID_MASK) return -EINVAL; if (pvid == br->default_pvid) goto out; /* Only allow default pvid change when filtering is disabled */ if (br_opt_get(br, BROPT_VLAN_ENABLED)) { pr_info_once("Please disable vlan filtering to change default_pvid\n"); err = -EPERM; goto out; } err = __br_vlan_set_default_pvid(br, pvid, extack); out: return err; } int br_vlan_init(struct net_bridge *br) { struct net_bridge_vlan_group *vg; int ret = -ENOMEM; vg = kzalloc(sizeof(*vg), GFP_KERNEL); if (!vg) goto out; ret = rhashtable_init(&vg->vlan_hash, &br_vlan_rht_params); if (ret) goto err_rhtbl; ret = vlan_tunnel_init(vg); if (ret) goto err_tunnel_init; INIT_LIST_HEAD(&vg->vlan_list); br->vlan_proto = htons(ETH_P_8021Q); br->default_pvid = 1; rcu_assign_pointer(br->vlgrp, vg); out: return ret; err_tunnel_init: rhashtable_destroy(&vg->vlan_hash); err_rhtbl: kfree(vg); goto out; } int nbp_vlan_init(struct net_bridge_port *p, struct netlink_ext_ack *extack) { struct switchdev_attr attr = { .orig_dev = p->br->dev, .id = SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING, .flags = SWITCHDEV_F_SKIP_EOPNOTSUPP, .u.vlan_filtering = br_opt_get(p->br, BROPT_VLAN_ENABLED), }; struct net_bridge_vlan_group *vg; int ret = -ENOMEM; vg = kzalloc(sizeof(struct net_bridge_vlan_group), GFP_KERNEL); if (!vg) goto out; ret = switchdev_port_attr_set(p->dev, &attr, extack); if (ret && ret != -EOPNOTSUPP) goto err_vlan_enabled; ret = rhashtable_init(&vg->vlan_hash, &br_vlan_rht_params); if (ret) goto err_rhtbl; ret = vlan_tunnel_init(vg); if (ret) goto err_tunnel_init; INIT_LIST_HEAD(&vg->vlan_list); rcu_assign_pointer(p->vlgrp, vg); if (p->br->default_pvid) { bool changed; ret = nbp_vlan_add(p, p->br->default_pvid, BRIDGE_VLAN_INFO_PVID | BRIDGE_VLAN_INFO_UNTAGGED, &changed, extack); if (ret) goto err_vlan_add; br_vlan_notify(p->br, p, p->br->default_pvid, 0, RTM_NEWVLAN); } out: return ret; err_vlan_add: RCU_INIT_POINTER(p->vlgrp, NULL); synchronize_rcu(); vlan_tunnel_deinit(vg); err_tunnel_init: rhashtable_destroy(&vg->vlan_hash); err_rhtbl: err_vlan_enabled: kfree(vg); goto out; } /* Must be protected by RTNL. * Must be called with vid in range from 1 to 4094 inclusive. * changed must be true only if the vlan was created or updated */ int nbp_vlan_add(struct net_bridge_port *port, u16 vid, u16 flags, bool *changed, struct netlink_ext_ack *extack) { struct net_bridge_vlan *vlan; int ret; ASSERT_RTNL(); *changed = false; vlan = br_vlan_find(nbp_vlan_group(port), vid); if (vlan) { bool would_change = __vlan_flags_would_change(vlan, flags); if (would_change) { /* Pass the flags to the hardware bridge */ ret = br_switchdev_port_vlan_add(port->dev, vid, flags, true, extack); if (ret && ret != -EOPNOTSUPP) return ret; } __vlan_flags_commit(vlan, flags); *changed = would_change; return 0; } vlan = kzalloc(sizeof(*vlan), GFP_KERNEL); if (!vlan) return -ENOMEM; vlan->vid = vid; vlan->port = port; ret = __vlan_add(vlan, flags, extack); if (ret) kfree(vlan); else *changed = true; return ret; } /* Must be protected by RTNL. * Must be called with vid in range from 1 to 4094 inclusive. */ int nbp_vlan_delete(struct net_bridge_port *port, u16 vid) { struct net_bridge_vlan *v; ASSERT_RTNL(); v = br_vlan_find(nbp_vlan_group(port), vid); if (!v) return -ENOENT; br_fdb_find_delete_local(port->br, port, port->dev->dev_addr, vid); br_fdb_delete_by_port(port->br, port, vid, 0); return __vlan_del(v); } void nbp_vlan_flush(struct net_bridge_port *port) { struct net_bridge_vlan_group *vg; ASSERT_RTNL(); vg = nbp_vlan_group(port); __vlan_flush(port->br, port, vg); RCU_INIT_POINTER(port->vlgrp, NULL); synchronize_net(); __vlan_group_free(vg); } void br_vlan_get_stats(const struct net_bridge_vlan *v, struct pcpu_sw_netstats *stats) { int i; memset(stats, 0, sizeof(*stats)); for_each_possible_cpu(i) { u64 rxpackets, rxbytes, txpackets, txbytes; struct pcpu_sw_netstats *cpu_stats; unsigned int start; cpu_stats = per_cpu_ptr(v->stats, i); do { start = u64_stats_fetch_begin(&cpu_stats->syncp); rxpackets = u64_stats_read(&cpu_stats->rx_packets); rxbytes = u64_stats_read(&cpu_stats->rx_bytes); txbytes = u64_stats_read(&cpu_stats->tx_bytes); txpackets = u64_stats_read(&cpu_stats->tx_packets); } while (u64_stats_fetch_retry(&cpu_stats->syncp, start)); u64_stats_add(&stats->rx_packets, rxpackets); u64_stats_add(&stats->rx_bytes, rxbytes); u64_stats_add(&stats->tx_bytes, txbytes); u64_stats_add(&stats->tx_packets, txpackets); } } int br_vlan_get_pvid(const struct net_device *dev, u16 *p_pvid) { struct net_bridge_vlan_group *vg; struct net_bridge_port *p; ASSERT_RTNL(); p = br_port_get_check_rtnl(dev); if (p) vg = nbp_vlan_group(p); else if (netif_is_bridge_master(dev)) vg = br_vlan_group(netdev_priv(dev)); else return -EINVAL; *p_pvid = br_get_pvid(vg); return 0; } EXPORT_SYMBOL_GPL(br_vlan_get_pvid); int br_vlan_get_pvid_rcu(const struct net_device *dev, u16 *p_pvid) { struct net_bridge_vlan_group *vg; struct net_bridge_port *p; p = br_port_get_check_rcu(dev); if (p) vg = nbp_vlan_group_rcu(p); else if (netif_is_bridge_master(dev)) vg = br_vlan_group_rcu(netdev_priv(dev)); else return -EINVAL; *p_pvid = br_get_pvid(vg); return 0; } EXPORT_SYMBOL_GPL(br_vlan_get_pvid_rcu); void br_vlan_fill_forward_path_pvid(struct net_bridge *br, struct net_device_path_ctx *ctx, struct net_device_path *path) { struct net_bridge_vlan_group *vg; int idx = ctx->num_vlans - 1; u16 vid; path->bridge.vlan_mode = DEV_PATH_BR_VLAN_KEEP; if (!br_opt_get(br, BROPT_VLAN_ENABLED)) return; vg = br_vlan_group_rcu(br); if (idx >= 0 && ctx->vlan[idx].proto == br->vlan_proto) { vid = ctx->vlan[idx].id; } else { path->bridge.vlan_mode = DEV_PATH_BR_VLAN_TAG; vid = br_get_pvid(vg); } path->bridge.vlan_id = vid; path->bridge.vlan_proto = br->vlan_proto; } int br_vlan_fill_forward_path_mode(struct net_bridge *br, struct net_bridge_port *dst, struct net_device_path *path) { struct net_bridge_vlan_group *vg; struct net_bridge_vlan *v; if (!br_opt_get(br, BROPT_VLAN_ENABLED)) return 0; vg = nbp_vlan_group_rcu(dst); v = br_vlan_find(vg, path->bridge.vlan_id); if (!v || !br_vlan_should_use(v)) return -EINVAL; if (!(v->flags & BRIDGE_VLAN_INFO_UNTAGGED)) return 0; if (path->bridge.vlan_mode == DEV_PATH_BR_VLAN_TAG) path->bridge.vlan_mode = DEV_PATH_BR_VLAN_KEEP; else if (v->priv_flags & BR_VLFLAG_ADDED_BY_SWITCHDEV) path->bridge.vlan_mode = DEV_PATH_BR_VLAN_UNTAG_HW; else path->bridge.vlan_mode = DEV_PATH_BR_VLAN_UNTAG; return 0; } int br_vlan_get_info(const struct net_device *dev, u16 vid, struct bridge_vlan_info *p_vinfo) { struct net_bridge_vlan_group *vg; struct net_bridge_vlan *v; struct net_bridge_port *p; ASSERT_RTNL(); p = br_port_get_check_rtnl(dev); if (p) vg = nbp_vlan_group(p); else if (netif_is_bridge_master(dev)) vg = br_vlan_group(netdev_priv(dev)); else return -EINVAL; v = br_vlan_find(vg, vid); if (!v) return -ENOENT; p_vinfo->vid = vid; p_vinfo->flags = v->flags; if (vid == br_get_pvid(vg)) p_vinfo->flags |= BRIDGE_VLAN_INFO_PVID; return 0; } EXPORT_SYMBOL_GPL(br_vlan_get_info); int br_vlan_get_info_rcu(const struct net_device *dev, u16 vid, struct bridge_vlan_info *p_vinfo) { struct net_bridge_vlan_group *vg; struct net_bridge_vlan *v; struct net_bridge_port *p; p = br_port_get_check_rcu(dev); if (p) vg = nbp_vlan_group_rcu(p); else if (netif_is_bridge_master(dev)) vg = br_vlan_group_rcu(netdev_priv(dev)); else return -EINVAL; v = br_vlan_find(vg, vid); if (!v) return -ENOENT; p_vinfo->vid = vid; p_vinfo->flags = v->flags; if (vid == br_get_pvid(vg)) p_vinfo->flags |= BRIDGE_VLAN_INFO_PVID; return 0; } EXPORT_SYMBOL_GPL(br_vlan_get_info_rcu); static int br_vlan_is_bind_vlan_dev(const struct net_device *dev) { return is_vlan_dev(dev) && !!(vlan_dev_priv(dev)->flags & VLAN_FLAG_BRIDGE_BINDING); } static int br_vlan_is_bind_vlan_dev_fn(struct net_device *dev, __always_unused struct netdev_nested_priv *priv) { return br_vlan_is_bind_vlan_dev(dev); } static bool br_vlan_has_upper_bind_vlan_dev(struct net_device *dev) { int found; rcu_read_lock(); found = netdev_walk_all_upper_dev_rcu(dev, br_vlan_is_bind_vlan_dev_fn, NULL); rcu_read_unlock(); return !!found; } struct br_vlan_bind_walk_data { u16 vid; struct net_device *result; }; static int br_vlan_match_bind_vlan_dev_fn(struct net_device *dev, struct netdev_nested_priv *priv) { struct br_vlan_bind_walk_data *data = priv->data; int found = 0; if (br_vlan_is_bind_vlan_dev(dev) && vlan_dev_priv(dev)->vlan_id == data->vid) { data->result = dev; found = 1; } return found; } static struct net_device * br_vlan_get_upper_bind_vlan_dev(struct net_device *dev, u16 vid) { struct br_vlan_bind_walk_data data = { .vid = vid, }; struct netdev_nested_priv priv = { .data = (void *)&data, }; rcu_read_lock(); netdev_walk_all_upper_dev_rcu(dev, br_vlan_match_bind_vlan_dev_fn, &priv); rcu_read_unlock(); return data.result; } static bool br_vlan_is_dev_up(const struct net_device *dev) { return !!(dev->flags & IFF_UP) && netif_oper_up(dev); } static void br_vlan_set_vlan_dev_state(const struct net_bridge *br, struct net_device *vlan_dev) { u16 vid = vlan_dev_priv(vlan_dev)->vlan_id; struct net_bridge_vlan_group *vg; struct net_bridge_port *p; bool has_carrier = false; if (!netif_carrier_ok(br->dev)) { netif_carrier_off(vlan_dev); return; } list_for_each_entry(p, &br->port_list, list) { vg = nbp_vlan_group(p); if (br_vlan_find(vg, vid) && br_vlan_is_dev_up(p->dev)) { has_carrier = true; break; } } if (has_carrier) netif_carrier_on(vlan_dev); else netif_carrier_off(vlan_dev); } static void br_vlan_set_all_vlan_dev_state(struct net_bridge_port *p) { struct net_bridge_vlan_group *vg = nbp_vlan_group(p); struct net_bridge_vlan *vlan; struct net_device *vlan_dev; list_for_each_entry(vlan, &vg->vlan_list, vlist) { vlan_dev = br_vlan_get_upper_bind_vlan_dev(p->br->dev, vlan->vid); if (vlan_dev) { if (br_vlan_is_dev_up(p->dev)) { if (netif_carrier_ok(p->br->dev)) netif_carrier_on(vlan_dev); } else { br_vlan_set_vlan_dev_state(p->br, vlan_dev); } } } } static void br_vlan_toggle_bridge_binding(struct net_device *br_dev, bool enable) { struct net_bridge *br = netdev_priv(br_dev); if (enable) br_opt_toggle(br, BROPT_VLAN_BRIDGE_BINDING, true); else br_opt_toggle(br, BROPT_VLAN_BRIDGE_BINDING, br_vlan_has_upper_bind_vlan_dev(br_dev)); } static void br_vlan_upper_change(struct net_device *dev, struct net_device *upper_dev, bool linking) { struct net_bridge *br = netdev_priv(dev); if (!br_vlan_is_bind_vlan_dev(upper_dev)) return; br_vlan_toggle_bridge_binding(dev, linking); if (linking) br_vlan_set_vlan_dev_state(br, upper_dev); } struct br_vlan_link_state_walk_data { struct net_bridge *br; }; static int br_vlan_link_state_change_fn(struct net_device *vlan_dev, struct netdev_nested_priv *priv) { struct br_vlan_link_state_walk_data *data = priv->data; if (br_vlan_is_bind_vlan_dev(vlan_dev)) br_vlan_set_vlan_dev_state(data->br, vlan_dev); return 0; } static void br_vlan_link_state_change(struct net_device *dev, struct net_bridge *br) { struct br_vlan_link_state_walk_data data = { .br = br }; struct netdev_nested_priv priv = { .data = (void *)&data, }; rcu_read_lock(); netdev_walk_all_upper_dev_rcu(dev, br_vlan_link_state_change_fn, &priv); rcu_read_unlock(); } /* Must be protected by RTNL. */ static void nbp_vlan_set_vlan_dev_state(struct net_bridge_port *p, u16 vid) { struct net_device *vlan_dev; if (!br_opt_get(p->br, BROPT_VLAN_BRIDGE_BINDING)) return; vlan_dev = br_vlan_get_upper_bind_vlan_dev(p->br->dev, vid); if (vlan_dev) br_vlan_set_vlan_dev_state(p->br, vlan_dev); } /* Must be protected by RTNL. */ int br_vlan_bridge_event(struct net_device *dev, unsigned long event, void *ptr) { struct netdev_notifier_changeupper_info *info; struct net_bridge *br = netdev_priv(dev); int vlcmd = 0, ret = 0; bool changed = false; switch (event) { case NETDEV_REGISTER: ret = br_vlan_add(br, br->default_pvid, BRIDGE_VLAN_INFO_PVID | BRIDGE_VLAN_INFO_UNTAGGED | BRIDGE_VLAN_INFO_BRENTRY, &changed, NULL); vlcmd = RTM_NEWVLAN; break; case NETDEV_UNREGISTER: changed = !br_vlan_delete(br, br->default_pvid); vlcmd = RTM_DELVLAN; break; case NETDEV_CHANGEUPPER: info = ptr; br_vlan_upper_change(dev, info->upper_dev, info->linking); break; case NETDEV_CHANGE: case NETDEV_UP: if (!br_opt_get(br, BROPT_VLAN_BRIDGE_BINDING)) break; br_vlan_link_state_change(dev, br); break; } if (changed) br_vlan_notify(br, NULL, br->default_pvid, 0, vlcmd); return ret; } void br_vlan_vlan_upper_event(struct net_device *br_dev, struct net_device *vlan_dev, unsigned long event) { struct vlan_dev_priv *vlan = vlan_dev_priv(vlan_dev); struct net_bridge *br = netdev_priv(br_dev); bool bridge_binding; switch (event) { case NETDEV_CHANGE: case NETDEV_UP: break; default: return; } bridge_binding = vlan->flags & VLAN_FLAG_BRIDGE_BINDING; br_vlan_toggle_bridge_binding(br_dev, bridge_binding); if (bridge_binding) br_vlan_set_vlan_dev_state(br, vlan_dev); else if (!bridge_binding && netif_carrier_ok(br_dev)) netif_carrier_on(vlan_dev); } /* Must be protected by RTNL. */ void br_vlan_port_event(struct net_bridge_port *p, unsigned long event) { if (!br_opt_get(p->br, BROPT_VLAN_BRIDGE_BINDING)) return; switch (event) { case NETDEV_CHANGE: case NETDEV_DOWN: case NETDEV_UP: br_vlan_set_all_vlan_dev_state(p); break; } } static bool br_vlan_stats_fill(struct sk_buff *skb, const struct net_bridge_vlan *v) { struct pcpu_sw_netstats stats; struct nlattr *nest; nest = nla_nest_start(skb, BRIDGE_VLANDB_ENTRY_STATS); if (!nest) return false; br_vlan_get_stats(v, &stats); if (nla_put_u64_64bit(skb, BRIDGE_VLANDB_STATS_RX_BYTES, u64_stats_read(&stats.rx_bytes), BRIDGE_VLANDB_STATS_PAD) || nla_put_u64_64bit(skb, BRIDGE_VLANDB_STATS_RX_PACKETS, u64_stats_read(&stats.rx_packets), BRIDGE_VLANDB_STATS_PAD) || nla_put_u64_64bit(skb, BRIDGE_VLANDB_STATS_TX_BYTES, u64_stats_read(&stats.tx_bytes), BRIDGE_VLANDB_STATS_PAD) || nla_put_u64_64bit(skb, BRIDGE_VLANDB_STATS_TX_PACKETS, u64_stats_read(&stats.tx_packets), BRIDGE_VLANDB_STATS_PAD)) goto out_err; nla_nest_end(skb, nest); return true; out_err: nla_nest_cancel(skb, nest); return false; } /* v_opts is used to dump the options which must be equal in the whole range */ static bool br_vlan_fill_vids(struct sk_buff *skb, u16 vid, u16 vid_range, const struct net_bridge_vlan *v_opts, const struct net_bridge_port *p, u16 flags, bool dump_stats) { struct bridge_vlan_info info; struct nlattr *nest; nest = nla_nest_start(skb, BRIDGE_VLANDB_ENTRY); if (!nest) return false; memset(&info, 0, sizeof(info)); info.vid = vid; if (flags & BRIDGE_VLAN_INFO_UNTAGGED) info.flags |= BRIDGE_VLAN_INFO_UNTAGGED; if (flags & BRIDGE_VLAN_INFO_PVID) info.flags |= BRIDGE_VLAN_INFO_PVID; if (nla_put(skb, BRIDGE_VLANDB_ENTRY_INFO, sizeof(info), &info)) goto out_err; if (vid_range && vid < vid_range && !(flags & BRIDGE_VLAN_INFO_PVID) && nla_put_u16(skb, BRIDGE_VLANDB_ENTRY_RANGE, vid_range)) goto out_err; if (v_opts) { if (!br_vlan_opts_fill(skb, v_opts, p)) goto out_err; if (dump_stats && !br_vlan_stats_fill(skb, v_opts)) goto out_err; } nla_nest_end(skb, nest); return true; out_err: nla_nest_cancel(skb, nest); return false; } static size_t rtnl_vlan_nlmsg_size(void) { return NLMSG_ALIGN(sizeof(struct br_vlan_msg)) + nla_total_size(0) /* BRIDGE_VLANDB_ENTRY */ + nla_total_size(sizeof(u16)) /* BRIDGE_VLANDB_ENTRY_RANGE */ + nla_total_size(sizeof(struct bridge_vlan_info)) /* BRIDGE_VLANDB_ENTRY_INFO */ + br_vlan_opts_nl_size(); /* bridge vlan options */ } void br_vlan_notify(const struct net_bridge *br, const struct net_bridge_port *p, u16 vid, u16 vid_range, int cmd) { struct net_bridge_vlan_group *vg; struct net_bridge_vlan *v = NULL; struct br_vlan_msg *bvm; struct nlmsghdr *nlh; struct sk_buff *skb; int err = -ENOBUFS; struct net *net; u16 flags = 0; int ifindex; /* right now notifications are done only with rtnl held */ ASSERT_RTNL(); if (p) { ifindex = p->dev->ifindex; vg = nbp_vlan_group(p); net = dev_net(p->dev); } else { ifindex = br->dev->ifindex; vg = br_vlan_group(br); net = dev_net(br->dev); } skb = nlmsg_new(rtnl_vlan_nlmsg_size(), GFP_KERNEL); if (!skb) goto out_err; err = -EMSGSIZE; nlh = nlmsg_put(skb, 0, 0, cmd, sizeof(*bvm), 0); if (!nlh) goto out_err; bvm = nlmsg_data(nlh); memset(bvm, 0, sizeof(*bvm)); bvm->family = AF_BRIDGE; bvm->ifindex = ifindex; switch (cmd) { case RTM_NEWVLAN: /* need to find the vlan due to flags/options */ v = br_vlan_find(vg, vid); if (!v || !br_vlan_should_use(v)) goto out_kfree; flags = v->flags; if (br_get_pvid(vg) == v->vid) flags |= BRIDGE_VLAN_INFO_PVID; break; case RTM_DELVLAN: break; default: goto out_kfree; } if (!br_vlan_fill_vids(skb, vid, vid_range, v, p, flags, false)) goto out_err; nlmsg_end(skb, nlh); rtnl_notify(skb, net, 0, RTNLGRP_BRVLAN, NULL, GFP_KERNEL); return; out_err: rtnl_set_sk_err(net, RTNLGRP_BRVLAN, err); out_kfree: kfree_skb(skb); } /* check if v_curr can enter a range ending in range_end */ bool br_vlan_can_enter_range(const struct net_bridge_vlan *v_curr, const struct net_bridge_vlan *range_end) { return v_curr->vid - range_end->vid == 1 && range_end->flags == v_curr->flags && br_vlan_opts_eq_range(v_curr, range_end); } static int br_vlan_dump_dev(const struct net_device *dev, struct sk_buff *skb, struct netlink_callback *cb, u32 dump_flags) { struct net_bridge_vlan *v, *range_start = NULL, *range_end = NULL; bool dump_global = !!(dump_flags & BRIDGE_VLANDB_DUMPF_GLOBAL); bool dump_stats = !!(dump_flags & BRIDGE_VLANDB_DUMPF_STATS); struct net_bridge_vlan_group *vg; int idx = 0, s_idx = cb->args[1]; struct nlmsghdr *nlh = NULL; struct net_bridge_port *p; struct br_vlan_msg *bvm; struct net_bridge *br; int err = 0; u16 pvid; if (!netif_is_bridge_master(dev) && !netif_is_bridge_port(dev)) return -EINVAL; if (netif_is_bridge_master(dev)) { br = netdev_priv(dev); vg = br_vlan_group_rcu(br); p = NULL; } else { /* global options are dumped only for bridge devices */ if (dump_global) return 0; p = br_port_get_rcu(dev); if (WARN_ON(!p)) return -EINVAL; vg = nbp_vlan_group_rcu(p); br = p->br; } if (!vg) return 0; nlh = nlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, RTM_NEWVLAN, sizeof(*bvm), NLM_F_MULTI); if (!nlh) return -EMSGSIZE; bvm = nlmsg_data(nlh); memset(bvm, 0, sizeof(*bvm)); bvm->family = PF_BRIDGE; bvm->ifindex = dev->ifindex; pvid = br_get_pvid(vg); /* idx must stay at range's beginning until it is filled in */ list_for_each_entry_rcu(v, &vg->vlan_list, vlist) { if (!dump_global && !br_vlan_should_use(v)) continue; if (idx < s_idx) { idx++; continue; } if (!range_start) { range_start = v; range_end = v; continue; } if (dump_global) { if (br_vlan_global_opts_can_enter_range(v, range_end)) goto update_end; if (!br_vlan_global_opts_fill(skb, range_start->vid, range_end->vid, range_start)) { err = -EMSGSIZE; break; } /* advance number of filled vlans */ idx += range_end->vid - range_start->vid + 1; range_start = v; } else if (dump_stats || v->vid == pvid || !br_vlan_can_enter_range(v, range_end)) { u16 vlan_flags = br_vlan_flags(range_start, pvid); if (!br_vlan_fill_vids(skb, range_start->vid, range_end->vid, range_start, p, vlan_flags, dump_stats)) { err = -EMSGSIZE; break; } /* advance number of filled vlans */ idx += range_end->vid - range_start->vid + 1; range_start = v; } update_end: range_end = v; } /* err will be 0 and range_start will be set in 3 cases here: * - first vlan (range_start == range_end) * - last vlan (range_start == range_end, not in range) * - last vlan range (range_start != range_end, in range) */ if (!err && range_start) { if (dump_global && !br_vlan_global_opts_fill(skb, range_start->vid, range_end->vid, range_start)) err = -EMSGSIZE; else if (!dump_global && !br_vlan_fill_vids(skb, range_start->vid, range_end->vid, range_start, p, br_vlan_flags(range_start, pvid), dump_stats)) err = -EMSGSIZE; } cb->args[1] = err ? idx : 0; nlmsg_end(skb, nlh); return err; } static const struct nla_policy br_vlan_db_dump_pol[BRIDGE_VLANDB_DUMP_MAX + 1] = { [BRIDGE_VLANDB_DUMP_FLAGS] = { .type = NLA_U32 }, }; static int br_vlan_rtm_dump(struct sk_buff *skb, struct netlink_callback *cb) { struct nlattr *dtb[BRIDGE_VLANDB_DUMP_MAX + 1]; int idx = 0, err = 0, s_idx = cb->args[0]; struct net *net = sock_net(skb->sk); struct br_vlan_msg *bvm; struct net_device *dev; u32 dump_flags = 0; err = nlmsg_parse(cb->nlh, sizeof(*bvm), dtb, BRIDGE_VLANDB_DUMP_MAX, br_vlan_db_dump_pol, cb->extack); if (err < 0) return err; bvm = nlmsg_data(cb->nlh); if (dtb[BRIDGE_VLANDB_DUMP_FLAGS]) dump_flags = nla_get_u32(dtb[BRIDGE_VLANDB_DUMP_FLAGS]); rcu_read_lock(); if (bvm->ifindex) { dev = dev_get_by_index_rcu(net, bvm->ifindex); if (!dev) { err = -ENODEV; goto out_err; } err = br_vlan_dump_dev(dev, skb, cb, dump_flags); /* if the dump completed without an error we return 0 here */ if (err != -EMSGSIZE) goto out_err; } else { for_each_netdev_rcu(net, dev) { if (idx < s_idx) goto skip; err = br_vlan_dump_dev(dev, skb, cb, dump_flags); if (err == -EMSGSIZE) break; skip: idx++; } } cb->args[0] = idx; rcu_read_unlock(); return skb->len; out_err: rcu_read_unlock(); return err; } static const struct nla_policy br_vlan_db_policy[BRIDGE_VLANDB_ENTRY_MAX + 1] = { [BRIDGE_VLANDB_ENTRY_INFO] = NLA_POLICY_EXACT_LEN(sizeof(struct bridge_vlan_info)), [BRIDGE_VLANDB_ENTRY_RANGE] = { .type = NLA_U16 }, [BRIDGE_VLANDB_ENTRY_STATE] = { .type = NLA_U8 }, [BRIDGE_VLANDB_ENTRY_TUNNEL_INFO] = { .type = NLA_NESTED }, [BRIDGE_VLANDB_ENTRY_MCAST_ROUTER] = { .type = NLA_U8 }, [BRIDGE_VLANDB_ENTRY_MCAST_N_GROUPS] = { .type = NLA_REJECT }, [BRIDGE_VLANDB_ENTRY_MCAST_MAX_GROUPS] = { .type = NLA_U32 }, [BRIDGE_VLANDB_ENTRY_NEIGH_SUPPRESS] = NLA_POLICY_MAX(NLA_U8, 1), }; static int br_vlan_rtm_process_one(struct net_device *dev, const struct nlattr *attr, int cmd, struct netlink_ext_ack *extack) { struct bridge_vlan_info *vinfo, vrange_end, *vinfo_last = NULL; struct nlattr *tb[BRIDGE_VLANDB_ENTRY_MAX + 1]; bool changed = false, skip_processing = false; struct net_bridge_vlan_group *vg; struct net_bridge_port *p = NULL; int err = 0, cmdmap = 0; struct net_bridge *br; if (netif_is_bridge_master(dev)) { br = netdev_priv(dev); vg = br_vlan_group(br); } else { p = br_port_get_rtnl(dev); if (WARN_ON(!p)) return -ENODEV; br = p->br; vg = nbp_vlan_group(p); } if (WARN_ON(!vg)) return -ENODEV; err = nla_parse_nested(tb, BRIDGE_VLANDB_ENTRY_MAX, attr, br_vlan_db_policy, extack); if (err) return err; if (!tb[BRIDGE_VLANDB_ENTRY_INFO]) { NL_SET_ERR_MSG_MOD(extack, "Missing vlan entry info"); return -EINVAL; } memset(&vrange_end, 0, sizeof(vrange_end)); vinfo = nla_data(tb[BRIDGE_VLANDB_ENTRY_INFO]); if (vinfo->flags & (BRIDGE_VLAN_INFO_RANGE_BEGIN | BRIDGE_VLAN_INFO_RANGE_END)) { NL_SET_ERR_MSG_MOD(extack, "Old-style vlan ranges are not allowed when using RTM vlan calls"); return -EINVAL; } if (!br_vlan_valid_id(vinfo->vid, extack)) return -EINVAL; if (tb[BRIDGE_VLANDB_ENTRY_RANGE]) { vrange_end.vid = nla_get_u16(tb[BRIDGE_VLANDB_ENTRY_RANGE]); /* validate user-provided flags without RANGE_BEGIN */ vrange_end.flags = BRIDGE_VLAN_INFO_RANGE_END | vinfo->flags; vinfo->flags |= BRIDGE_VLAN_INFO_RANGE_BEGIN; /* vinfo_last is the range start, vinfo the range end */ vinfo_last = vinfo; vinfo = &vrange_end; if (!br_vlan_valid_id(vinfo->vid, extack) || !br_vlan_valid_range(vinfo, vinfo_last, extack)) return -EINVAL; } switch (cmd) { case RTM_NEWVLAN: cmdmap = RTM_SETLINK; skip_processing = !!(vinfo->flags & BRIDGE_VLAN_INFO_ONLY_OPTS); break; case RTM_DELVLAN: cmdmap = RTM_DELLINK; break; } if (!skip_processing) { struct bridge_vlan_info *tmp_last = vinfo_last; /* br_process_vlan_info may overwrite vinfo_last */ err = br_process_vlan_info(br, p, cmdmap, vinfo, &tmp_last, &changed, extack); /* notify first if anything changed */ if (changed) br_ifinfo_notify(cmdmap, br, p); if (err) return err; } /* deal with options */ if (cmd == RTM_NEWVLAN) { struct net_bridge_vlan *range_start, *range_end; if (vinfo_last) { range_start = br_vlan_find(vg, vinfo_last->vid); range_end = br_vlan_find(vg, vinfo->vid); } else { range_start = br_vlan_find(vg, vinfo->vid); range_end = range_start; } err = br_vlan_process_options(br, p, range_start, range_end, tb, extack); } return err; } static int br_vlan_rtm_process(struct sk_buff *skb, struct nlmsghdr *nlh, struct netlink_ext_ack *extack) { struct net *net = sock_net(skb->sk); struct br_vlan_msg *bvm; struct net_device *dev; struct nlattr *attr; int err, vlans = 0; int rem; /* this should validate the header and check for remaining bytes */ err = nlmsg_parse(nlh, sizeof(*bvm), NULL, BRIDGE_VLANDB_MAX, NULL, extack); if (err < 0) return err; bvm = nlmsg_data(nlh); dev = __dev_get_by_index(net, bvm->ifindex); if (!dev) return -ENODEV; if (!netif_is_bridge_master(dev) && !netif_is_bridge_port(dev)) { NL_SET_ERR_MSG_MOD(extack, "The device is not a valid bridge or bridge port"); return -EINVAL; } nlmsg_for_each_attr(attr, nlh, sizeof(*bvm), rem) { switch (nla_type(attr)) { case BRIDGE_VLANDB_ENTRY: err = br_vlan_rtm_process_one(dev, attr, nlh->nlmsg_type, extack); break; case BRIDGE_VLANDB_GLOBAL_OPTIONS: err = br_vlan_rtm_process_global_options(dev, attr, nlh->nlmsg_type, extack); break; default: continue; } vlans++; if (err) break; } if (!vlans) { NL_SET_ERR_MSG_MOD(extack, "No vlans found to process"); err = -EINVAL; } return err; } static const struct rtnl_msg_handler br_vlan_rtnl_msg_handlers[] = { {THIS_MODULE, PF_BRIDGE, RTM_NEWVLAN, br_vlan_rtm_process, NULL, 0}, {THIS_MODULE, PF_BRIDGE, RTM_DELVLAN, br_vlan_rtm_process, NULL, 0}, {THIS_MODULE, PF_BRIDGE, RTM_GETVLAN, NULL, br_vlan_rtm_dump, 0}, }; int br_vlan_rtnl_init(void) { return rtnl_register_many(br_vlan_rtnl_msg_handlers); } void br_vlan_rtnl_uninit(void) { rtnl_unregister_many(br_vlan_rtnl_msg_handlers); } |
| 444 501 503 446 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _SCSI_SCSI_DEVICE_H #define _SCSI_SCSI_DEVICE_H #include <linux/list.h> #include <linux/spinlock.h> #include <linux/workqueue.h> #include <linux/blk-mq.h> #include <scsi/scsi.h> #include <linux/atomic.h> #include <linux/sbitmap.h> struct bsg_device; struct device; struct request_queue; struct scsi_cmnd; struct scsi_lun; struct scsi_sense_hdr; typedef __u64 __bitwise blist_flags_t; #define SCSI_SENSE_BUFFERSIZE 96 struct scsi_mode_data { __u32 length; __u16 block_descriptor_length; __u8 medium_type; __u8 device_specific; __u8 header_length; __u8 longlba:1; }; /* * sdev state: If you alter this, you also need to alter scsi_sysfs.c * (for the ascii descriptions) and the state model enforcer: * scsi_lib:scsi_device_set_state(). */ enum scsi_device_state { SDEV_CREATED = 1, /* device created but not added to sysfs * Only internal commands allowed (for inq) */ SDEV_RUNNING, /* device properly configured * All commands allowed */ SDEV_CANCEL, /* beginning to delete device * Only error handler commands allowed */ SDEV_DEL, /* device deleted * no commands allowed */ SDEV_QUIESCE, /* Device quiescent. No block commands * will be accepted, only specials (which * originate in the mid-layer) */ SDEV_OFFLINE, /* Device offlined (by error handling or * user request */ SDEV_TRANSPORT_OFFLINE, /* Offlined by transport class error handler */ SDEV_BLOCK, /* Device blocked by scsi lld. No * scsi commands from user or midlayer * should be issued to the scsi * lld. */ SDEV_CREATED_BLOCK, /* same as above but for created devices */ }; enum scsi_scan_mode { SCSI_SCAN_INITIAL = 0, SCSI_SCAN_RESCAN, SCSI_SCAN_MANUAL, }; enum scsi_device_event { SDEV_EVT_MEDIA_CHANGE = 1, /* media has changed */ SDEV_EVT_INQUIRY_CHANGE_REPORTED, /* 3F 03 UA reported */ SDEV_EVT_CAPACITY_CHANGE_REPORTED, /* 2A 09 UA reported */ SDEV_EVT_SOFT_THRESHOLD_REACHED_REPORTED, /* 38 07 UA reported */ SDEV_EVT_MODE_PARAMETER_CHANGE_REPORTED, /* 2A 01 UA reported */ SDEV_EVT_LUN_CHANGE_REPORTED, /* 3F 0E UA reported */ SDEV_EVT_ALUA_STATE_CHANGE_REPORTED, /* 2A 06 UA reported */ SDEV_EVT_POWER_ON_RESET_OCCURRED, /* 29 00 UA reported */ SDEV_EVT_FIRST = SDEV_EVT_MEDIA_CHANGE, SDEV_EVT_LAST = SDEV_EVT_POWER_ON_RESET_OCCURRED, SDEV_EVT_MAXBITS = SDEV_EVT_LAST + 1 }; struct scsi_event { enum scsi_device_event evt_type; struct list_head node; /* put union of data structures, for non-simple event types, * here */ }; /** * struct scsi_vpd - SCSI Vital Product Data * @rcu: For kfree_rcu(). * @len: Length in bytes of @data. * @data: VPD data as defined in various T10 SCSI standard documents. */ struct scsi_vpd { struct rcu_head rcu; int len; unsigned char data[]; }; struct scsi_device { struct Scsi_Host *host; struct request_queue *request_queue; /* the next two are protected by the host->host_lock */ struct list_head siblings; /* list of all devices on this host */ struct list_head same_target_siblings; /* just the devices sharing same target id */ struct sbitmap budget_map; atomic_t device_blocked; /* Device returned QUEUE_FULL. */ atomic_t restarts; spinlock_t list_lock; struct list_head starved_entry; unsigned short queue_depth; /* How deep of a queue we want */ unsigned short max_queue_depth; /* max queue depth */ unsigned short last_queue_full_depth; /* These two are used by */ unsigned short last_queue_full_count; /* scsi_track_queue_full() */ unsigned long last_queue_full_time; /* last queue full time */ unsigned long queue_ramp_up_period; /* ramp up period in jiffies */ #define SCSI_DEFAULT_RAMP_UP_PERIOD (120 * HZ) unsigned long last_queue_ramp_up; /* last queue ramp up time */ unsigned int id, channel; u64 lun; unsigned int manufacturer; /* Manufacturer of device, for using * vendor-specific cmd's */ unsigned sector_size; /* size in bytes */ void *hostdata; /* available to low-level driver */ unsigned char type; char scsi_level; char inq_periph_qual; /* PQ from INQUIRY data */ struct mutex inquiry_mutex; unsigned char inquiry_len; /* valid bytes in 'inquiry' */ unsigned char * inquiry; /* INQUIRY response data */ const char * vendor; /* [back_compat] point into 'inquiry' ... */ const char * model; /* ... after scan; point to static string */ const char * rev; /* ... "nullnullnullnull" before scan */ #define SCSI_DEFAULT_VPD_LEN 255 /* default SCSI VPD page size (max) */ struct scsi_vpd __rcu *vpd_pg0; struct scsi_vpd __rcu *vpd_pg83; struct scsi_vpd __rcu *vpd_pg80; struct scsi_vpd __rcu *vpd_pg89; struct scsi_vpd __rcu *vpd_pgb0; struct scsi_vpd __rcu *vpd_pgb1; struct scsi_vpd __rcu *vpd_pgb2; struct scsi_vpd __rcu *vpd_pgb7; struct scsi_target *sdev_target; blist_flags_t sdev_bflags; /* black/white flags as also found in * scsi_devinfo.[hc]. For now used only to * pass settings from sdev_init to scsi * core. */ unsigned int eh_timeout; /* Error handling timeout */ /* * If true, let the high-level device driver (sd) manage the device * power state for system suspend/resume (suspend to RAM and * hibernation) operations. */ unsigned manage_system_start_stop:1; /* * If true, let the high-level device driver (sd) manage the device * power state for runtime device suspand and resume operations. */ unsigned manage_runtime_start_stop:1; /* * If true, let the high-level device driver (sd) manage the device * power state for system shutdown (power off) operations. */ unsigned manage_shutdown:1; /* * If set and if the device is runtime suspended, ask the high-level * device driver (sd) to force a runtime resume of the device. */ unsigned force_runtime_start_on_system_start:1; /* * Set if the device is an ATA device. */ unsigned is_ata:1; unsigned removable:1; unsigned changed:1; /* Data invalid due to media change */ unsigned busy:1; /* Used to prevent races */ unsigned lockable:1; /* Able to prevent media removal */ unsigned locked:1; /* Media removal disabled */ unsigned borken:1; /* Tell the Seagate driver to be * painfully slow on this device */ unsigned disconnect:1; /* can disconnect */ unsigned soft_reset:1; /* Uses soft reset option */ unsigned sdtr:1; /* Device supports SDTR messages */ unsigned wdtr:1; /* Device supports WDTR messages */ unsigned ppr:1; /* Device supports PPR messages */ unsigned tagged_supported:1; /* Supports SCSI-II tagged queuing */ unsigned simple_tags:1; /* simple queue tag messages are enabled */ unsigned was_reset:1; /* There was a bus reset on the bus for * this device */ unsigned expecting_cc_ua:1; /* Expecting a CHECK_CONDITION/UNIT_ATTN * because we did a bus reset. */ unsigned use_10_for_rw:1; /* first try 10-byte read / write */ unsigned use_10_for_ms:1; /* first try 10-byte mode sense/select */ unsigned set_dbd_for_ms:1; /* Set "DBD" field in mode sense */ unsigned read_before_ms:1; /* perform a READ before MODE SENSE */ unsigned no_report_opcodes:1; /* no REPORT SUPPORTED OPERATION CODES */ unsigned no_write_same:1; /* no WRITE SAME command */ unsigned use_16_for_rw:1; /* Use read/write(16) over read/write(10) */ unsigned use_16_for_sync:1; /* Use sync (16) over sync (10) */ unsigned skip_ms_page_8:1; /* do not use MODE SENSE page 0x08 */ unsigned skip_ms_page_3f:1; /* do not use MODE SENSE page 0x3f */ unsigned skip_vpd_pages:1; /* do not read VPD pages */ unsigned try_vpd_pages:1; /* attempt to read VPD pages */ unsigned use_192_bytes_for_3f:1; /* ask for 192 bytes from page 0x3f */ unsigned no_start_on_add:1; /* do not issue start on add */ unsigned allow_restart:1; /* issue START_UNIT in error handler */ unsigned start_stop_pwr_cond:1; /* Set power cond. in START_STOP_UNIT */ unsigned no_uld_attach:1; /* disable connecting to upper level drivers */ unsigned select_no_atn:1; unsigned fix_capacity:1; /* READ_CAPACITY is too high by 1 */ unsigned guess_capacity:1; /* READ_CAPACITY might be too high by 1 */ unsigned retry_hwerror:1; /* Retry HARDWARE_ERROR */ unsigned last_sector_bug:1; /* do not use multisector accesses on SD_LAST_BUGGY_SECTORS */ unsigned no_read_disc_info:1; /* Avoid READ_DISC_INFO cmds */ unsigned no_read_capacity_16:1; /* Avoid READ_CAPACITY_16 cmds */ unsigned try_rc_10_first:1; /* Try READ_CAPACACITY_10 first */ unsigned security_supported:1; /* Supports Security Protocols */ unsigned is_visible:1; /* is the device visible in sysfs */ unsigned wce_default_on:1; /* Cache is ON by default */ unsigned no_dif:1; /* T10 PI (DIF) should be disabled */ unsigned broken_fua:1; /* Don't set FUA bit */ unsigned lun_in_cdb:1; /* Store LUN bits in CDB[1] */ unsigned unmap_limit_for_ws:1; /* Use the UNMAP limit for WRITE SAME */ unsigned rpm_autosuspend:1; /* Enable runtime autosuspend at device * creation time */ unsigned ignore_media_change:1; /* Ignore MEDIA CHANGE on resume */ unsigned silence_suspend:1; /* Do not print runtime PM related messages */ unsigned no_vpd_size:1; /* No VPD size reported in header */ unsigned cdl_supported:1; /* Command duration limits supported */ unsigned cdl_enable:1; /* Enable/disable Command duration limits */ unsigned int queue_stopped; /* request queue is quiesced */ bool offline_already; /* Device offline message logged */ unsigned int ua_new_media_ctr; /* Counter for New Media UNIT ATTENTIONs */ unsigned int ua_por_ctr; /* Counter for Power On / Reset UAs */ atomic_t disk_events_disable_depth; /* disable depth for disk events */ DECLARE_BITMAP(supported_events, SDEV_EVT_MAXBITS); /* supported events */ DECLARE_BITMAP(pending_events, SDEV_EVT_MAXBITS); /* pending events */ struct list_head event_list; /* asserted events */ struct work_struct event_work; unsigned int max_device_blocked; /* what device_blocked counts down from */ #define SCSI_DEFAULT_DEVICE_BLOCKED 3 atomic_t iorequest_cnt; atomic_t iodone_cnt; atomic_t ioerr_cnt; atomic_t iotmo_cnt; struct device sdev_gendev, sdev_dev; struct work_struct requeue_work; struct scsi_device_handler *handler; void *handler_data; size_t dma_drain_len; void *dma_drain_buf; unsigned int sg_timeout; unsigned int sg_reserved_size; struct bsg_device *bsg_dev; unsigned char access_state; struct mutex state_mutex; enum scsi_device_state sdev_state; struct task_struct *quiesced_by; unsigned long sdev_data[]; } __attribute__((aligned(sizeof(unsigned long)))); #define to_scsi_device(d) \ container_of(d, struct scsi_device, sdev_gendev) #define class_to_sdev(d) \ container_of(d, struct scsi_device, sdev_dev) #define transport_class_to_sdev(class_dev) \ to_scsi_device(class_dev->parent) #define sdev_dbg(sdev, fmt, a...) \ dev_dbg(&(sdev)->sdev_gendev, fmt, ##a) /* * like scmd_printk, but the device name is passed in * as a string pointer */ __printf(4, 5) void sdev_prefix_printk(const char *, const struct scsi_device *, const char *, const char *, ...); #define sdev_printk(l, sdev, fmt, a...) \ sdev_prefix_printk(l, sdev, NULL, fmt, ##a) __printf(3, 4) void scmd_printk(const char *, const struct scsi_cmnd *, const char *, ...); #define scmd_dbg(scmd, fmt, a...) \ do { \ struct request *__rq = scsi_cmd_to_rq((scmd)); \ \ if (__rq->q->disk) \ sdev_dbg((scmd)->device, "[%s] " fmt, \ __rq->q->disk->disk_name, ##a); \ else \ sdev_dbg((scmd)->device, fmt, ##a); \ } while (0) enum scsi_target_state { STARGET_CREATED = 1, STARGET_RUNNING, STARGET_REMOVE, STARGET_CREATED_REMOVE, STARGET_DEL, }; /* * scsi_target: representation of a scsi target, for now, this is only * used for single_lun devices. If no one has active IO to the target, * starget_sdev_user is NULL, else it points to the active sdev. */ struct scsi_target { struct scsi_device *starget_sdev_user; struct list_head siblings; struct list_head devices; struct device dev; struct kref reap_ref; /* last put renders target invisible */ unsigned int channel; unsigned int id; /* target id ... replace * scsi_device.id eventually */ unsigned int create:1; /* signal that it needs to be added */ unsigned int single_lun:1; /* Indicates we should only * allow I/O to one of the luns * for the device at a time. */ unsigned int pdt_1f_for_no_lun:1; /* PDT = 0x1f * means no lun present. */ unsigned int no_report_luns:1; /* Don't use * REPORT LUNS for scanning. */ unsigned int expecting_lun_change:1; /* A device has reported * a 3F/0E UA, other devices on * the same target will also. */ /* commands actually active on LLD. */ atomic_t target_busy; atomic_t target_blocked; /* * LLDs should set this in the sdev_init host template callout. * If set to zero then there is not limit. */ unsigned int can_queue; unsigned int max_target_blocked; #define SCSI_DEFAULT_TARGET_BLOCKED 3 char scsi_level; enum scsi_target_state state; void *hostdata; /* available to low-level driver */ unsigned long starget_data[]; /* for the transport */ /* starget_data must be the last element!!!! */ } __attribute__((aligned(sizeof(unsigned long)))); #define to_scsi_target(d) container_of(d, struct scsi_target, dev) static inline struct scsi_target *scsi_target(struct scsi_device *sdev) { return to_scsi_target(sdev->sdev_gendev.parent); } #define transport_class_to_starget(class_dev) \ to_scsi_target(class_dev->parent) #define starget_printk(prefix, starget, fmt, a...) \ dev_printk(prefix, &(starget)->dev, fmt, ##a) extern struct scsi_device *__scsi_add_device(struct Scsi_Host *, uint, uint, u64, void *hostdata); extern int scsi_add_device(struct Scsi_Host *host, uint channel, uint target, u64 lun); extern int scsi_register_device_handler(struct scsi_device_handler *scsi_dh); extern void scsi_remove_device(struct scsi_device *); extern int scsi_unregister_device_handler(struct scsi_device_handler *scsi_dh); void scsi_attach_vpd(struct scsi_device *sdev); void scsi_cdl_check(struct scsi_device *sdev); int scsi_cdl_enable(struct scsi_device *sdev, bool enable); extern struct scsi_device *scsi_device_from_queue(struct request_queue *q); extern int __must_check scsi_device_get(struct scsi_device *); extern void scsi_device_put(struct scsi_device *); extern struct scsi_device *scsi_device_lookup(struct Scsi_Host *, uint, uint, u64); extern struct scsi_device *__scsi_device_lookup(struct Scsi_Host *, uint, uint, u64); extern struct scsi_device *scsi_device_lookup_by_target(struct scsi_target *, u64); extern struct scsi_device *__scsi_device_lookup_by_target(struct scsi_target *, u64); extern void starget_for_each_device(struct scsi_target *, void *, void (*fn)(struct scsi_device *, void *)); extern void __starget_for_each_device(struct scsi_target *, void *, void (*fn)(struct scsi_device *, void *)); /* only exposed to implement shost_for_each_device */ extern struct scsi_device *__scsi_iterate_devices(struct Scsi_Host *, struct scsi_device *); /** * shost_for_each_device - iterate over all devices of a host * @sdev: the &struct scsi_device to use as a cursor * @shost: the &struct scsi_host to iterate over * * Iterator that returns each device attached to @shost. This loop * takes a reference on each device and releases it at the end. If * you break out of the loop, you must call scsi_device_put(sdev). */ #define shost_for_each_device(sdev, shost) \ for ((sdev) = __scsi_iterate_devices((shost), NULL); \ (sdev); \ (sdev) = __scsi_iterate_devices((shost), (sdev))) /** * __shost_for_each_device - iterate over all devices of a host (UNLOCKED) * @sdev: the &struct scsi_device to use as a cursor * @shost: the &struct scsi_host to iterate over * * Iterator that returns each device attached to @shost. It does _not_ * take a reference on the scsi_device, so the whole loop must be * protected by shost->host_lock. * * Note: The only reason to use this is because you need to access the * device list in interrupt context. Otherwise you really want to use * shost_for_each_device instead. */ #define __shost_for_each_device(sdev, shost) \ list_for_each_entry((sdev), &((shost)->__devices), siblings) extern int scsi_change_queue_depth(struct scsi_device *, int); extern int scsi_track_queue_full(struct scsi_device *, int); extern int scsi_set_medium_removal(struct scsi_device *, char); int scsi_mode_sense(struct scsi_device *sdev, int dbd, int modepage, int subpage, unsigned char *buffer, int len, int timeout, int retries, struct scsi_mode_data *data, struct scsi_sense_hdr *); extern int scsi_mode_select(struct scsi_device *sdev, int pf, int sp, unsigned char *buffer, int len, int timeout, int retries, struct scsi_mode_data *data, struct scsi_sense_hdr *); extern int scsi_test_unit_ready(struct scsi_device *sdev, int timeout, int retries, struct scsi_sense_hdr *sshdr); extern int scsi_get_vpd_page(struct scsi_device *, u8 page, unsigned char *buf, int buf_len); int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer, unsigned int len, unsigned char opcode, unsigned short sa); extern int scsi_device_set_state(struct scsi_device *sdev, enum scsi_device_state state); extern struct scsi_event *sdev_evt_alloc(enum scsi_device_event evt_type, gfp_t gfpflags); extern void sdev_evt_send(struct scsi_device *sdev, struct scsi_event *evt); extern void sdev_evt_send_simple(struct scsi_device *sdev, enum scsi_device_event evt_type, gfp_t gfpflags); extern int scsi_device_quiesce(struct scsi_device *sdev); extern void scsi_device_resume(struct scsi_device *sdev); extern void scsi_target_quiesce(struct scsi_target *); extern void scsi_target_resume(struct scsi_target *); extern void scsi_scan_target(struct device *parent, unsigned int channel, unsigned int id, u64 lun, enum scsi_scan_mode rescan); extern void scsi_target_reap(struct scsi_target *); void scsi_block_targets(struct Scsi_Host *shost, struct device *dev); extern void scsi_target_unblock(struct device *, enum scsi_device_state); extern void scsi_remove_target(struct device *); extern const char *scsi_device_state_name(enum scsi_device_state); extern int scsi_is_sdev_device(const struct device *); extern int scsi_is_target_device(const struct device *); extern void scsi_sanitize_inquiry_string(unsigned char *s, int len); /* * scsi_execute_cmd users can set scsi_failure.result to have * scsi_check_passthrough fail/retry a command. scsi_failure.result can be a * specific host byte or message code, or SCMD_FAILURE_RESULT_ANY can be used * to match any host or message code. */ #define SCMD_FAILURE_RESULT_ANY 0x7fffffff /* * Set scsi_failure.result to SCMD_FAILURE_STAT_ANY to fail/retry any failure * scsi_status_is_good returns false for. */ #define SCMD_FAILURE_STAT_ANY 0xff /* * The following can be set to the scsi_failure sense, asc and ascq fields to * match on any sense, ASC, or ASCQ value. */ #define SCMD_FAILURE_SENSE_ANY 0xff #define SCMD_FAILURE_ASC_ANY 0xff #define SCMD_FAILURE_ASCQ_ANY 0xff /* Always retry a matching failure. */ #define SCMD_FAILURE_NO_LIMIT -1 struct scsi_failure { int result; u8 sense; u8 asc; u8 ascq; /* * Number of times scsi_execute_cmd will retry the failure. It does * not count for the total_allowed. */ s8 allowed; /* Number of times the failure has been retried. */ s8 retries; }; struct scsi_failures { /* * If a scsi_failure does not have a retry limit setup this limit will * be used. */ int total_allowed; int total_retries; struct scsi_failure *failure_definitions; }; /* Optional arguments to scsi_execute_cmd */ struct scsi_exec_args { unsigned char *sense; /* sense buffer */ unsigned int sense_len; /* sense buffer len */ struct scsi_sense_hdr *sshdr; /* decoded sense header */ blk_mq_req_flags_t req_flags; /* BLK_MQ_REQ flags */ int scmd_flags; /* SCMD flags */ int *resid; /* residual length */ struct scsi_failures *failures; /* failures to retry */ }; int scsi_execute_cmd(struct scsi_device *sdev, const unsigned char *cmd, blk_opf_t opf, void *buffer, unsigned int bufflen, int timeout, int retries, const struct scsi_exec_args *args); void scsi_failures_reset_retries(struct scsi_failures *failures); extern void sdev_disable_disk_events(struct scsi_device *sdev); extern void sdev_enable_disk_events(struct scsi_device *sdev); extern int scsi_vpd_lun_id(struct scsi_device *, char *, size_t); extern int scsi_vpd_tpg_id(struct scsi_device *, int *); #ifdef CONFIG_PM extern int scsi_autopm_get_device(struct scsi_device *); extern void scsi_autopm_put_device(struct scsi_device *); #else static inline int scsi_autopm_get_device(struct scsi_device *d) { return 0; } static inline void scsi_autopm_put_device(struct scsi_device *d) {} #endif /* CONFIG_PM */ static inline int __must_check scsi_device_reprobe(struct scsi_device *sdev) { return device_reprobe(&sdev->sdev_gendev); } static inline unsigned int sdev_channel(struct scsi_device *sdev) { return sdev->channel; } static inline unsigned int sdev_id(struct scsi_device *sdev) { return sdev->id; } #define scmd_id(scmd) sdev_id((scmd)->device) #define scmd_channel(scmd) sdev_channel((scmd)->device) /* * checks for positions of the SCSI state machine */ static inline int scsi_device_online(struct scsi_device *sdev) { return (sdev->sdev_state != SDEV_OFFLINE && sdev->sdev_state != SDEV_TRANSPORT_OFFLINE && sdev->sdev_state != SDEV_DEL); } static inline int scsi_device_blocked(struct scsi_device *sdev) { return sdev->sdev_state == SDEV_BLOCK || sdev->sdev_state == SDEV_CREATED_BLOCK; } static inline int scsi_device_created(struct scsi_device *sdev) { return sdev->sdev_state == SDEV_CREATED || sdev->sdev_state == SDEV_CREATED_BLOCK; } int scsi_internal_device_block_nowait(struct scsi_device *sdev); int scsi_internal_device_unblock_nowait(struct scsi_device *sdev, enum scsi_device_state new_state); /* accessor functions for the SCSI parameters */ static inline int scsi_device_sync(struct scsi_device *sdev) { return sdev->sdtr; } static inline int scsi_device_wide(struct scsi_device *sdev) { return sdev->wdtr; } static inline int scsi_device_dt(struct scsi_device *sdev) { return sdev->ppr; } static inline int scsi_device_dt_only(struct scsi_device *sdev) { if (sdev->inquiry_len < 57) return 0; return (sdev->inquiry[56] & 0x0c) == 0x04; } static inline int scsi_device_ius(struct scsi_device *sdev) { if (sdev->inquiry_len < 57) return 0; return sdev->inquiry[56] & 0x01; } static inline int scsi_device_qas(struct scsi_device *sdev) { if (sdev->inquiry_len < 57) return 0; return sdev->inquiry[56] & 0x02; } static inline int scsi_device_enclosure(struct scsi_device *sdev) { return sdev->inquiry ? (sdev->inquiry[6] & (1<<6)) : 1; } static inline int scsi_device_protection(struct scsi_device *sdev) { if (sdev->no_dif) return 0; return sdev->scsi_level > SCSI_2 && sdev->inquiry[5] & (1<<0); } static inline int scsi_device_tpgs(struct scsi_device *sdev) { return sdev->inquiry ? (sdev->inquiry[5] >> 4) & 0x3 : 0; } /** * scsi_device_supports_vpd - test if a device supports VPD pages * @sdev: the &struct scsi_device to test * * If the 'try_vpd_pages' flag is set it takes precedence. * Otherwise we will assume VPD pages are supported if the * SCSI level is at least SPC-3 and 'skip_vpd_pages' is not set. */ static inline int scsi_device_supports_vpd(struct scsi_device *sdev) { /* Attempt VPD inquiry if the device blacklist explicitly calls * for it. */ if (sdev->try_vpd_pages) return 1; /* * Although VPD inquiries can go to SCSI-2 type devices, * some USB ones crash on receiving them, and the pages * we currently ask for are mandatory for SPC-2 and beyond */ if (sdev->scsi_level >= SCSI_SPC_2 && !sdev->skip_vpd_pages) return 1; return 0; } static inline int scsi_device_busy(struct scsi_device *sdev) { return sbitmap_weight(&sdev->budget_map); } /* Macros to access the UNIT ATTENTION counters */ #define scsi_get_ua_new_media_ctr(sdev) \ ((const unsigned int)(sdev->ua_new_media_ctr)) #define scsi_get_ua_por_ctr(sdev) \ ((const unsigned int)(sdev->ua_por_ctr)) #define MODULE_ALIAS_SCSI_DEVICE(type) \ MODULE_ALIAS("scsi:t-" __stringify(type) "*") #define SCSI_DEVICE_MODALIAS_FMT "scsi:t-0x%02x" #endif /* _SCSI_SCSI_DEVICE_H */ |
| 15 15 15 15 15 15 15 15 15 15 15 15 15 6 5 5 4 1 1 1 1 4 5 6 51 51 2 2 1 1 51 1 49 335 9 2 7 2 5 4 4 330 35 35 34 35 35 34 20 34 35 35 15 15 15 15 1 1 15 136 33 35 35 35 15 15 35 14 15 136 2 1 1 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 | // SPDX-License-Identifier: GPL-2.0-only /* * Yama Linux Security Module * * Author: Kees Cook <keescook@chromium.org> * * Copyright (C) 2010 Canonical, Ltd. * Copyright (C) 2011 The Chromium OS Authors. */ #include <linux/lsm_hooks.h> #include <linux/sysctl.h> #include <linux/ptrace.h> #include <linux/prctl.h> #include <linux/ratelimit.h> #include <linux/workqueue.h> #include <linux/string_helpers.h> #include <linux/task_work.h> #include <linux/sched.h> #include <linux/spinlock.h> #include <uapi/linux/lsm.h> #define YAMA_SCOPE_DISABLED 0 #define YAMA_SCOPE_RELATIONAL 1 #define YAMA_SCOPE_CAPABILITY 2 #define YAMA_SCOPE_NO_ATTACH 3 static int ptrace_scope = YAMA_SCOPE_RELATIONAL; /* describe a ptrace relationship for potential exception */ struct ptrace_relation { struct task_struct *tracer; struct task_struct *tracee; bool invalid; struct list_head node; struct rcu_head rcu; }; static LIST_HEAD(ptracer_relations); static DEFINE_SPINLOCK(ptracer_relations_lock); static void yama_relation_cleanup(struct work_struct *work); static DECLARE_WORK(yama_relation_work, yama_relation_cleanup); struct access_report_info { struct callback_head work; const char *access; struct task_struct *target; struct task_struct *agent; }; static void __report_access(struct callback_head *work) { struct access_report_info *info = container_of(work, struct access_report_info, work); char *target_cmd, *agent_cmd; target_cmd = kstrdup_quotable_cmdline(info->target, GFP_KERNEL); agent_cmd = kstrdup_quotable_cmdline(info->agent, GFP_KERNEL); pr_notice_ratelimited( "ptrace %s of \"%s\"[%d] was attempted by \"%s\"[%d]\n", info->access, target_cmd, info->target->pid, agent_cmd, info->agent->pid); kfree(agent_cmd); kfree(target_cmd); put_task_struct(info->agent); put_task_struct(info->target); kfree(info); } /* defers execution because cmdline access can sleep */ static void report_access(const char *access, struct task_struct *target, struct task_struct *agent) { struct access_report_info *info; assert_spin_locked(&target->alloc_lock); /* for target->comm */ if (current->flags & PF_KTHREAD) { /* I don't think kthreads call task_work_run() before exiting. * Imagine angry ranting about procfs here. */ pr_notice_ratelimited( "ptrace %s of \"%s\"[%d] was attempted by \"%s\"[%d]\n", access, target->comm, target->pid, agent->comm, agent->pid); return; } info = kmalloc(sizeof(*info), GFP_ATOMIC); if (!info) return; init_task_work(&info->work, __report_access); get_task_struct(target); get_task_struct(agent); info->access = access; info->target = target; info->agent = agent; if (task_work_add(current, &info->work, TWA_RESUME) == 0) return; /* success */ WARN(1, "report_access called from exiting task"); put_task_struct(target); put_task_struct(agent); kfree(info); } /** * yama_relation_cleanup - remove invalid entries from the relation list * @work: unused * */ static void yama_relation_cleanup(struct work_struct *work) { struct ptrace_relation *relation; spin_lock(&ptracer_relations_lock); rcu_read_lock(); list_for_each_entry_rcu(relation, &ptracer_relations, node) { if (relation->invalid) { list_del_rcu(&relation->node); kfree_rcu(relation, rcu); } } rcu_read_unlock(); spin_unlock(&ptracer_relations_lock); } /** * yama_ptracer_add - add/replace an exception for this tracer/tracee pair * @tracer: the task_struct of the process doing the ptrace * @tracee: the task_struct of the process to be ptraced * * Each tracee can have, at most, one tracer registered. Each time this * is called, the prior registered tracer will be replaced for the tracee. * * Returns 0 if relationship was added, -ve on error. */ static int yama_ptracer_add(struct task_struct *tracer, struct task_struct *tracee) { struct ptrace_relation *relation, *added; added = kmalloc(sizeof(*added), GFP_KERNEL); if (!added) return -ENOMEM; added->tracee = tracee; added->tracer = tracer; added->invalid = false; spin_lock(&ptracer_relations_lock); rcu_read_lock(); list_for_each_entry_rcu(relation, &ptracer_relations, node) { if (relation->invalid) continue; if (relation->tracee == tracee) { list_replace_rcu(&relation->node, &added->node); kfree_rcu(relation, rcu); goto out; } } list_add_rcu(&added->node, &ptracer_relations); out: rcu_read_unlock(); spin_unlock(&ptracer_relations_lock); return 0; } /** * yama_ptracer_del - remove exceptions related to the given tasks * @tracer: remove any relation where tracer task matches * @tracee: remove any relation where tracee task matches */ static void yama_ptracer_del(struct task_struct *tracer, struct task_struct *tracee) { struct ptrace_relation *relation; bool marked = false; rcu_read_lock(); list_for_each_entry_rcu(relation, &ptracer_relations, node) { if (relation->invalid) continue; if (relation->tracee == tracee || (tracer && relation->tracer == tracer)) { relation->invalid = true; marked = true; } } rcu_read_unlock(); if (marked) schedule_work(&yama_relation_work); } /** * yama_task_free - check for task_pid to remove from exception list * @task: task being removed */ static void yama_task_free(struct task_struct *task) { yama_ptracer_del(task, task); } /** * yama_task_prctl - check for Yama-specific prctl operations * @option: operation * @arg2: argument * @arg3: argument * @arg4: argument * @arg5: argument * * Return 0 on success, -ve on error. -ENOSYS is returned when Yama * does not handle the given option. */ static int yama_task_prctl(int option, unsigned long arg2, unsigned long arg3, unsigned long arg4, unsigned long arg5) { int rc = -ENOSYS; struct task_struct *myself; switch (option) { case PR_SET_PTRACER: /* Since a thread can call prctl(), find the group leader * before calling _add() or _del() on it, since we want * process-level granularity of control. The tracer group * leader checking is handled later when walking the ancestry * at the time of PTRACE_ATTACH check. */ myself = current->group_leader; if (arg2 == 0) { yama_ptracer_del(NULL, myself); rc = 0; } else if (arg2 == PR_SET_PTRACER_ANY || (int)arg2 == -1) { rc = yama_ptracer_add(NULL, myself); } else { struct task_struct *tracer; tracer = find_get_task_by_vpid(arg2); if (!tracer) { rc = -EINVAL; } else { rc = yama_ptracer_add(tracer, myself); put_task_struct(tracer); } } break; } return rc; } /** * task_is_descendant - walk up a process family tree looking for a match * @parent: the process to compare against while walking up from child * @child: the process to start from while looking upwards for parent * * Returns 1 if child is a descendant of parent, 0 if not. */ static int task_is_descendant(struct task_struct *parent, struct task_struct *child) { int rc = 0; struct task_struct *walker = child; if (!parent || !child) return 0; rcu_read_lock(); if (!thread_group_leader(parent)) parent = rcu_dereference(parent->group_leader); while (walker->pid > 0) { if (!thread_group_leader(walker)) walker = rcu_dereference(walker->group_leader); if (walker == parent) { rc = 1; break; } walker = rcu_dereference(walker->real_parent); } rcu_read_unlock(); return rc; } /** * ptracer_exception_found - tracer registered as exception for this tracee * @tracer: the task_struct of the process attempting ptrace * @tracee: the task_struct of the process to be ptraced * * Returns 1 if tracer has a ptracer exception ancestor for tracee. */ static int ptracer_exception_found(struct task_struct *tracer, struct task_struct *tracee) { int rc = 0; struct ptrace_relation *relation; struct task_struct *parent = NULL; bool found = false; rcu_read_lock(); /* * If there's already an active tracing relationship, then make an * exception for the sake of other accesses, like process_vm_rw(). */ parent = ptrace_parent(tracee); if (parent != NULL && same_thread_group(parent, tracer)) { rc = 1; goto unlock; } /* Look for a PR_SET_PTRACER relationship. */ if (!thread_group_leader(tracee)) tracee = rcu_dereference(tracee->group_leader); list_for_each_entry_rcu(relation, &ptracer_relations, node) { if (relation->invalid) continue; if (relation->tracee == tracee) { parent = relation->tracer; found = true; break; } } if (found && (parent == NULL || task_is_descendant(parent, tracer))) rc = 1; unlock: rcu_read_unlock(); return rc; } /** * yama_ptrace_access_check - validate PTRACE_ATTACH calls * @child: task that current task is attempting to ptrace * @mode: ptrace attach mode * * Returns 0 if following the ptrace is allowed, -ve on error. */ static int yama_ptrace_access_check(struct task_struct *child, unsigned int mode) { int rc = 0; /* require ptrace target be a child of ptracer on attach */ if (mode & PTRACE_MODE_ATTACH) { switch (ptrace_scope) { case YAMA_SCOPE_DISABLED: /* No additional restrictions. */ break; case YAMA_SCOPE_RELATIONAL: rcu_read_lock(); if (!pid_alive(child)) rc = -EPERM; if (!rc && !task_is_descendant(current, child) && !ptracer_exception_found(current, child) && !ns_capable(__task_cred(child)->user_ns, CAP_SYS_PTRACE)) rc = -EPERM; rcu_read_unlock(); break; case YAMA_SCOPE_CAPABILITY: rcu_read_lock(); if (!ns_capable(__task_cred(child)->user_ns, CAP_SYS_PTRACE)) rc = -EPERM; rcu_read_unlock(); break; case YAMA_SCOPE_NO_ATTACH: default: rc = -EPERM; break; } } if (rc && (mode & PTRACE_MODE_NOAUDIT) == 0) report_access("attach", child, current); return rc; } /** * yama_ptrace_traceme - validate PTRACE_TRACEME calls * @parent: task that will become the ptracer of the current task * * Returns 0 if following the ptrace is allowed, -ve on error. */ static int yama_ptrace_traceme(struct task_struct *parent) { int rc = 0; /* Only disallow PTRACE_TRACEME on more aggressive settings. */ switch (ptrace_scope) { case YAMA_SCOPE_CAPABILITY: if (!has_ns_capability(parent, current_user_ns(), CAP_SYS_PTRACE)) rc = -EPERM; break; case YAMA_SCOPE_NO_ATTACH: rc = -EPERM; break; } if (rc) { task_lock(current); report_access("traceme", current, parent); task_unlock(current); } return rc; } static const struct lsm_id yama_lsmid = { .name = "yama", .id = LSM_ID_YAMA, }; static struct security_hook_list yama_hooks[] __ro_after_init = { LSM_HOOK_INIT(ptrace_access_check, yama_ptrace_access_check), LSM_HOOK_INIT(ptrace_traceme, yama_ptrace_traceme), LSM_HOOK_INIT(task_prctl, yama_task_prctl), LSM_HOOK_INIT(task_free, yama_task_free), }; #ifdef CONFIG_SYSCTL static int yama_dointvec_minmax(const struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { struct ctl_table table_copy; if (write && !capable(CAP_SYS_PTRACE)) return -EPERM; /* Lock the max value if it ever gets set. */ table_copy = *table; if (*(int *)table_copy.data == *(int *)table_copy.extra2) table_copy.extra1 = table_copy.extra2; return proc_dointvec_minmax(&table_copy, write, buffer, lenp, ppos); } static int max_scope = YAMA_SCOPE_NO_ATTACH; static const struct ctl_table yama_sysctl_table[] = { { .procname = "ptrace_scope", .data = &ptrace_scope, .maxlen = sizeof(int), .mode = 0644, .proc_handler = yama_dointvec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = &max_scope, }, }; static void __init yama_init_sysctl(void) { if (!register_sysctl("kernel/yama", yama_sysctl_table)) panic("Yama: sysctl registration failed.\n"); } #else static inline void yama_init_sysctl(void) { } #endif /* CONFIG_SYSCTL */ static int __init yama_init(void) { pr_info("Yama: becoming mindful.\n"); security_add_hooks(yama_hooks, ARRAY_SIZE(yama_hooks), &yama_lsmid); yama_init_sysctl(); return 0; } DEFINE_LSM(yama) = { .name = "yama", .init = yama_init, }; |
| 48 57 333 335 333 3 333 67 67 67 67 407 22 525 525 524 525 524 525 525 575 525 575 197 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | // SPDX-License-Identifier: GPL-2.0-only /* -*- linux-c -*- * sysctl_net.c: sysctl interface to net subsystem. * * Begun April 1, 1996, Mike Shaver. * Added /proc/sys/net directories for each protocol family. [MS] * * Revision 1.2 1996/05/08 20:24:40 shaver * Added bits for NET_BRIDGE and the NET_IPV4_ARP stuff and * NET_IPV4_IP_FORWARD. * * */ #include <linux/mm.h> #include <linux/export.h> #include <linux/sysctl.h> #include <linux/nsproxy.h> #include <net/sock.h> #ifdef CONFIG_INET #include <net/ip.h> #endif #ifdef CONFIG_NET #include <linux/if_ether.h> #endif static struct ctl_table_set * net_ctl_header_lookup(struct ctl_table_root *root) { return ¤t->nsproxy->net_ns->sysctls; } static int is_seen(struct ctl_table_set *set) { return ¤t->nsproxy->net_ns->sysctls == set; } /* Return standard mode bits for table entry. */ static int net_ctl_permissions(struct ctl_table_header *head, const struct ctl_table *table) { struct net *net = container_of(head->set, struct net, sysctls); /* Allow network administrator to have same access as root. */ if (ns_capable_noaudit(net->user_ns, CAP_NET_ADMIN)) { int mode = (table->mode >> 6) & 7; return (mode << 6) | (mode << 3) | mode; } return table->mode; } static void net_ctl_set_ownership(struct ctl_table_header *head, kuid_t *uid, kgid_t *gid) { struct net *net = container_of(head->set, struct net, sysctls); kuid_t ns_root_uid; kgid_t ns_root_gid; ns_root_uid = make_kuid(net->user_ns, 0); if (uid_valid(ns_root_uid)) *uid = ns_root_uid; ns_root_gid = make_kgid(net->user_ns, 0); if (gid_valid(ns_root_gid)) *gid = ns_root_gid; } static struct ctl_table_root net_sysctl_root = { .lookup = net_ctl_header_lookup, .permissions = net_ctl_permissions, .set_ownership = net_ctl_set_ownership, }; static int __net_init sysctl_net_init(struct net *net) { setup_sysctl_set(&net->sysctls, &net_sysctl_root, is_seen); return 0; } static void __net_exit sysctl_net_exit(struct net *net) { retire_sysctl_set(&net->sysctls); } static struct pernet_operations sysctl_pernet_ops = { .init = sysctl_net_init, .exit = sysctl_net_exit, }; static struct ctl_table_header *net_header; __init int net_sysctl_init(void) { static struct ctl_table empty[1]; int ret = -ENOMEM; /* Avoid limitations in the sysctl implementation by * registering "/proc/sys/net" as an empty directory not in a * network namespace. */ net_header = register_sysctl_sz("net", empty, 0); if (!net_header) goto out; ret = register_pernet_subsys(&sysctl_pernet_ops); if (ret) goto out1; out: return ret; out1: unregister_sysctl_table(net_header); net_header = NULL; goto out; } /* Verify that sysctls for non-init netns are safe by either: * 1) being read-only, or * 2) having a data pointer which points outside of the global kernel/module * data segment, and rather into the heap where a per-net object was * allocated. */ static void ensure_safe_net_sysctl(struct net *net, const char *path, struct ctl_table *table, size_t table_size) { struct ctl_table *ent; pr_debug("Registering net sysctl (net %p): %s\n", net, path); ent = table; for (size_t i = 0; i < table_size; ent++, i++) { unsigned long addr; const char *where; pr_debug(" procname=%s mode=%o proc_handler=%ps data=%p\n", ent->procname, ent->mode, ent->proc_handler, ent->data); /* If it's not writable inside the netns, then it can't hurt. */ if ((ent->mode & 0222) == 0) { pr_debug(" Not writable by anyone\n"); continue; } /* Where does data point? */ addr = (unsigned long)ent->data; if (is_module_address(addr)) where = "module"; else if (is_kernel_core_data(addr)) where = "kernel"; else continue; /* If it is writable and points to kernel/module global * data, then it's probably a netns leak. */ WARN(1, "sysctl %s/%s: data points to %s global data: %ps\n", path, ent->procname, where, ent->data); /* Make it "safe" by dropping writable perms */ ent->mode &= ~0222; } } struct ctl_table_header *register_net_sysctl_sz(struct net *net, const char *path, struct ctl_table *table, size_t table_size) { if (!net_eq(net, &init_net)) ensure_safe_net_sysctl(net, path, table, table_size); return __register_sysctl_table(&net->sysctls, path, table, table_size); } EXPORT_SYMBOL_GPL(register_net_sysctl_sz); void unregister_net_sysctl_table(struct ctl_table_header *header) { unregister_sysctl_table(header); } EXPORT_SYMBOL_GPL(unregister_net_sysctl_table); |
| 3502 2421 771 832 832 834 771 791 834 817 752 576 640 1017 1015 1016 834 838 1017 990 924 15 84 84 84 1057 1057 1056 1017 1015 1056 1025 1026 125 120 121 121 1363 122 1362 1361 1101 198 1100 1056 1057 1057 1057 1056 1330 1262 1332 1330 1330 1310 1331 209 207 207 196 207 1332 1331 4914 4908 4911 1853 1849 27 1854 117 115 117 117 1304 1304 1304 1235 526 487 2713 2716 2716 2884 542 542 53 103 103 5 102 102 102 102 102 102 24 24 102 4 4 4 4 4 4 4 1 1 1 1 1 1 103 103 103 103 9 9 9 9 9 9 221 219 3 222 212 222 222 222 3 226 226 10 10 10 8 1 7 7 2 9 226 225 9 9 221 221 212 212 222 247 261 480 2217 2217 225 226 226 226 226 226 221 221 226 220 226 103 103 103 103 11 226 9 1 226 226 226 226 226 226 226 9 225 225 225 226 226 226 226 226 57 57 57 226 226 225 226 226 226 225 226 226 226 222 226 226 226 226 226 226 226 222 228 228 228 228 224 226 8 226 225 225 225 224 226 226 226 226 226 222 222 221 221 1052 1032 1042 85 8 5 1017 625 625 427 625 617 641 521 520 131 804 804 1040 1065 48 48 48 1042 1042 1040 131 146 1017 92 92 92 55 55 37 30 28 30 8 8 46 1121 641 574 641 635 104 1073 1067 1053 1120 1120 1120 1120 1121 10 1120 1120 8 1120 1120 428 427 1121 428 1120 474 1121 8 1121 5 1166 1165 1167 99 10 94 1160 33 1 1161 1121 1167 1335 1335 1334 1167 1165 1335 1348 1350 1350 1335 1335 1350 1350 1349 1350 1350 1350 1350 1349 1421 1420 1422 920 1422 1420 143 109 108 1347 1333 1331 1334 139 1332 140 1332 211 1332 121 121 121 3 121 121 121 121 3 3 3 107 802 801 801 803 803 803 777 780 2 2 2 2 31 31 31 2 31 31 31 31 31 31 31 2 31 31 30 31 31 6 6 6 6 6 25 25 25 18 25 25 25 25 25 25 25 25 25 26 26 26 25 25 25 25 26 6 6 6 6 6 6 13 14 13 13 13 13 13 14 13 13 14 13 13 14 14 14 13 14 14 14 14 13 14 13 1 1 14 14 14 14 14 13 14 1 14 14 1 14 6 14 14 14 14 14 1 1363 1333 1339 1335 35 35 1367 1363 1360 1364 1364 1339 1364 1369 1366 1369 1365 1370 1364 1336 1365 1370 1367 1364 1366 1370 1366 1366 1337 1339 35 1370 1364 1365 1370 1366 1366 1360 1368 1370 1363 1370 1370 1365 1369 1335 35 597 252 1 1 1 4411 74 74 74 74 74 1090 1093 1093 1093 1028 1081 1080 1080 1063 1093 109 109 109 109 109 109 109 3063 1111 1 2417 2415 1432 11 1431 11 482 482 252 480 482 482 482 482 253 481 482 482 479 252 3 247 247 481 3 3 3 478 479 3 480 251 480 9 481 479 257 481 252 252 252 481 67 67 67 2 65 67 68 70 68 67 6 67 67 1 67 2 1 67 68 18 18 5 5 5 3 3 3 3 3 3 3 3 259 18 250 249 7 249 43 43 8 43 5 38 557 556 556 331 548 68 71 264 261 43 41 43 2 557 42 252 248 482 96 33 33 22 33 33 33 28 282 94 282 257 259 560 561 466 466 196 6617 3096 6626 6 6 6 6 598 597 598 38 32 6 6 6 6 561 561 364 464 463 4 4 461 461 561 559 561 561 560 560 559 560 247 101 561 560 560 561 560 561 561 560 561 552 33 561 561 307 561 560 560 561 17 558 556 556 243 561 561 559 525 526 524 61 561 559 560 6 559 560 559 559 11 11 116 116 116 561 2362 1103 1954 1959 1957 1959 3087 3094 3087 1331 1327 1329 1320 1324 1958 1959 1959 1763 1955 1958 1952 1955 1959 1956 1953 1952 1953 3088 3080 3095 48 3990 3284 1090 3986 3881 3870 21 3993 190 154 158 190 190 189 154 153 154 15 154 154 154 4335 4347 4340 1069 3573 94 3502 1068 14 14 4332 4347 6116 4236 3873 1071 3875 3881 3871 100 3871 3202 75 3170 3167 3855 1128 1134 1131 1123 1116 3853 3870 3858 1108 1 3852 3841 3851 3859 2 2 2 2 2 2 3534 2483 3541 3536 3542 3603 3583 3580 3613 3542 3086 3193 3063 4059 3067 150 18 18 18 18 18 14 14 14 12 1069 1069 1059 1055 33 33 1058 44 163 1058 4386 6 5 5 4385 4552 1087 18 1146 4397 223 6605 566 567 55 55 55 55 1 1 1 7019 6985 4721 4392 4394 1243 3944 3935 6616 1243 585 756 757 758 659 558 161 161 279 279 7030 4411 7009 7020 7015 7029 7015 4441 7017 7031 7030 7013 4795 567 6926 4412 3 3 78 79 66 56 576 28 6995 7280 7283 7300 7300 7248 560 7134 506 6116 7299 7290 4 4 7283 3459 3463 7282 7284 7288 7300 7300 7280 836 7299 390 7014 7290 84 7285 835 7300 7288 370 369 369 370 368 366 4932 4921 4913 4916 79 4926 4924 61 60 61 61 59 59 5 2 1 4 59 59 8 59 59 29 59 59 59 59 46 15 15 15 19874 19889 19876 19874 15 377 368 375 376 15 5 1 5 377 27 28 2 366 368 366 368 15 15 15 15 5258 5258 5009 451 369 1757 72 1759 72 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046 4047 4048 4049 4050 4051 4052 4053 4054 4055 4056 4057 4058 4059 4060 4061 4062 4063 4064 4065 4066 4067 4068 4069 4070 4071 4072 4073 4074 4075 4076 4077 4078 4079 4080 4081 4082 4083 4084 4085 4086 4087 4088 4089 4090 4091 4092 4093 4094 4095 4096 4097 4098 4099 4100 4101 4102 4103 4104 4105 4106 4107 4108 4109 4110 4111 4112 4113 4114 4115 4116 4117 4118 4119 4120 4121 4122 4123 4124 4125 4126 4127 4128 4129 4130 4131 4132 4133 4134 4135 4136 4137 4138 4139 4140 4141 4142 4143 4144 4145 4146 4147 4148 4149 4150 4151 4152 4153 4154 4155 4156 4157 4158 4159 4160 4161 4162 4163 4164 4165 4166 4167 4168 4169 4170 4171 4172 4173 4174 4175 4176 4177 4178 4179 4180 4181 4182 4183 4184 4185 4186 4187 4188 4189 4190 4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 4202 4203 4204 4205 4206 4207 4208 4209 4210 4211 4212 4213 4214 4215 4216 4217 4218 4219 4220 4221 4222 4223 4224 4225 4226 4227 4228 4229 4230 4231 4232 4233 4234 4235 4236 4237 4238 4239 4240 4241 4242 4243 4244 4245 4246 4247 4248 4249 4250 4251 4252 4253 4254 4255 4256 4257 4258 4259 4260 4261 4262 4263 4264 4265 4266 4267 4268 4269 4270 4271 4272 4273 4274 4275 4276 4277 4278 4279 4280 4281 4282 4283 4284 4285 4286 4287 4288 4289 4290 4291 4292 4293 4294 4295 4296 4297 4298 4299 4300 4301 4302 4303 4304 4305 4306 4307 4308 4309 4310 4311 4312 4313 4314 4315 4316 4317 4318 4319 4320 4321 4322 4323 4324 4325 4326 4327 4328 4329 4330 4331 4332 4333 4334 4335 4336 4337 4338 4339 4340 4341 4342 4343 4344 4345 4346 4347 4348 4349 4350 4351 4352 4353 4354 4355 4356 4357 4358 4359 4360 4361 4362 4363 4364 4365 4366 4367 4368 4369 4370 4371 4372 4373 4374 4375 4376 4377 4378 4379 4380 4381 4382 4383 4384 4385 4386 4387 4388 4389 4390 4391 4392 4393 4394 4395 4396 4397 4398 4399 4400 4401 4402 4403 4404 4405 4406 4407 4408 4409 4410 4411 4412 4413 4414 4415 4416 4417 4418 4419 4420 4421 4422 4423 4424 4425 4426 4427 4428 4429 4430 4431 4432 4433 4434 4435 4436 4437 4438 4439 4440 4441 4442 4443 4444 4445 4446 4447 4448 4449 4450 4451 4452 4453 4454 4455 4456 4457 4458 4459 4460 4461 4462 4463 4464 4465 4466 4467 4468 4469 4470 4471 4472 4473 4474 4475 4476 4477 4478 4479 4480 4481 4482 4483 4484 4485 4486 4487 4488 4489 4490 4491 4492 4493 4494 4495 4496 4497 4498 4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 4509 4510 4511 4512 4513 4514 4515 4516 4517 4518 4519 4520 4521 4522 4523 4524 4525 4526 4527 4528 4529 4530 4531 4532 4533 4534 4535 4536 4537 4538 4539 4540 4541 4542 4543 4544 4545 4546 4547 4548 4549 4550 4551 4552 4553 4554 4555 4556 4557 4558 4559 4560 4561 4562 4563 4564 4565 4566 4567 4568 4569 4570 4571 4572 4573 4574 4575 4576 4577 4578 4579 4580 4581 4582 4583 4584 4585 4586 4587 4588 4589 4590 4591 4592 4593 4594 4595 4596 4597 4598 4599 4600 4601 4602 4603 4604 4605 4606 4607 4608 4609 4610 4611 4612 4613 4614 4615 4616 4617 4618 4619 4620 4621 4622 4623 4624 4625 4626 4627 4628 4629 4630 4631 4632 4633 4634 4635 4636 4637 4638 4639 4640 4641 4642 4643 4644 4645 4646 4647 4648 4649 4650 4651 4652 4653 4654 4655 4656 4657 4658 4659 4660 4661 4662 4663 4664 4665 4666 4667 4668 4669 4670 4671 4672 4673 4674 4675 4676 4677 4678 4679 4680 4681 4682 4683 4684 4685 4686 4687 4688 4689 4690 4691 4692 4693 4694 4695 4696 4697 4698 4699 4700 4701 4702 4703 4704 4705 4706 4707 4708 4709 4710 4711 4712 4713 4714 4715 4716 4717 4718 4719 4720 4721 4722 4723 4724 4725 4726 4727 4728 4729 4730 4731 4732 4733 4734 4735 4736 4737 4738 4739 4740 4741 4742 4743 4744 4745 4746 4747 4748 4749 4750 4751 4752 4753 4754 4755 4756 4757 4758 4759 4760 4761 4762 4763 4764 4765 4766 4767 4768 4769 4770 4771 4772 4773 4774 4775 4776 4777 4778 4779 4780 4781 4782 4783 4784 4785 4786 4787 4788 4789 4790 4791 4792 4793 4794 4795 4796 4797 4798 4799 4800 4801 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816 4817 4818 4819 4820 4821 4822 4823 4824 4825 4826 4827 4828 4829 4830 4831 4832 4833 4834 4835 4836 4837 4838 4839 4840 4841 4842 4843 4844 4845 4846 4847 4848 4849 4850 4851 4852 4853 4854 4855 4856 4857 4858 4859 4860 4861 4862 4863 4864 4865 4866 4867 4868 4869 4870 4871 4872 4873 4874 4875 4876 4877 4878 4879 4880 4881 4882 4883 4884 4885 4886 4887 4888 4889 4890 4891 4892 4893 4894 4895 4896 4897 4898 4899 4900 4901 4902 4903 4904 4905 4906 4907 4908 4909 4910 4911 4912 4913 4914 4915 4916 4917 4918 4919 4920 4921 4922 4923 4924 4925 4926 4927 4928 4929 4930 4931 4932 4933 4934 4935 4936 4937 4938 4939 4940 4941 4942 4943 4944 4945 4946 4947 4948 4949 4950 4951 4952 4953 4954 4955 4956 4957 4958 4959 4960 4961 4962 4963 4964 4965 4966 4967 4968 4969 4970 4971 4972 4973 4974 4975 4976 4977 4978 4979 4980 4981 4982 4983 4984 4985 4986 4987 4988 4989 4990 4991 4992 4993 4994 4995 4996 4997 4998 4999 5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 5015 5016 5017 5018 5019 5020 5021 5022 5023 5024 5025 5026 5027 5028 5029 5030 5031 5032 5033 5034 5035 5036 5037 5038 5039 5040 5041 5042 5043 5044 5045 5046 5047 5048 5049 5050 5051 5052 5053 5054 5055 5056 5057 5058 5059 5060 5061 5062 5063 5064 5065 5066 5067 5068 5069 5070 5071 5072 5073 5074 5075 5076 5077 5078 5079 5080 5081 5082 5083 5084 5085 5086 5087 5088 5089 5090 5091 5092 5093 5094 5095 5096 5097 5098 5099 5100 5101 5102 5103 5104 5105 5106 5107 5108 5109 5110 5111 5112 5113 5114 5115 5116 5117 5118 5119 5120 5121 5122 5123 5124 5125 5126 5127 5128 5129 5130 5131 5132 5133 5134 5135 5136 5137 5138 5139 5140 5141 5142 5143 5144 5145 5146 5147 5148 5149 5150 5151 5152 5153 5154 5155 5156 5157 5158 5159 5160 5161 5162 5163 5164 5165 5166 5167 5168 5169 5170 5171 5172 5173 5174 5175 5176 5177 5178 5179 5180 5181 5182 5183 5184 5185 5186 5187 5188 5189 5190 5191 5192 5193 5194 5195 5196 5197 5198 5199 5200 5201 5202 5203 5204 5205 5206 5207 5208 5209 5210 5211 5212 5213 5214 5215 5216 5217 5218 5219 5220 5221 5222 5223 5224 5225 5226 5227 5228 5229 5230 5231 5232 5233 5234 5235 5236 5237 5238 5239 5240 5241 5242 5243 5244 5245 5246 5247 5248 5249 5250 5251 5252 5253 5254 5255 5256 5257 5258 5259 5260 5261 5262 5263 5264 5265 5266 5267 5268 5269 5270 5271 5272 5273 5274 5275 5276 5277 5278 5279 5280 5281 5282 5283 5284 5285 5286 5287 5288 5289 5290 5291 5292 5293 5294 5295 5296 5297 5298 5299 5300 5301 5302 5303 5304 5305 5306 5307 5308 5309 5310 5311 5312 5313 5314 5315 5316 5317 5318 5319 5320 5321 5322 5323 5324 5325 5326 5327 5328 5329 5330 5331 5332 5333 5334 5335 5336 5337 5338 5339 5340 5341 5342 5343 5344 5345 5346 5347 5348 5349 5350 5351 5352 5353 5354 5355 5356 5357 5358 5359 5360 5361 5362 5363 5364 5365 5366 5367 5368 5369 5370 5371 5372 5373 5374 5375 5376 5377 5378 5379 5380 5381 5382 5383 5384 5385 5386 5387 5388 5389 5390 5391 5392 5393 5394 5395 5396 5397 5398 5399 5400 5401 5402 5403 5404 5405 5406 5407 5408 5409 5410 5411 5412 5413 5414 5415 5416 5417 5418 5419 5420 5421 5422 5423 5424 5425 5426 5427 5428 5429 5430 5431 5432 5433 5434 5435 5436 5437 5438 5439 5440 5441 5442 5443 5444 5445 5446 5447 5448 5449 5450 5451 5452 5453 5454 5455 5456 5457 5458 5459 5460 5461 5462 5463 5464 5465 5466 5467 5468 5469 5470 5471 5472 5473 5474 5475 5476 5477 5478 5479 5480 5481 5482 5483 5484 5485 5486 5487 5488 5489 5490 5491 5492 5493 5494 5495 5496 5497 5498 5499 5500 5501 5502 5503 5504 5505 5506 5507 5508 5509 5510 5511 5512 5513 5514 5515 5516 5517 5518 5519 5520 5521 5522 5523 5524 5525 5526 5527 5528 5529 5530 5531 5532 5533 5534 5535 5536 5537 5538 5539 5540 5541 5542 5543 5544 5545 5546 5547 5548 5549 5550 5551 5552 5553 5554 5555 5556 5557 5558 5559 5560 5561 5562 5563 5564 5565 5566 5567 5568 5569 5570 5571 5572 5573 5574 5575 5576 5577 5578 5579 5580 5581 5582 5583 5584 5585 5586 5587 5588 5589 5590 5591 5592 5593 5594 5595 5596 5597 5598 5599 5600 5601 5602 5603 5604 5605 5606 5607 5608 5609 5610 5611 5612 5613 5614 5615 5616 5617 5618 5619 5620 5621 5622 5623 5624 5625 5626 5627 5628 5629 5630 5631 5632 5633 5634 5635 5636 5637 5638 5639 5640 5641 5642 5643 5644 5645 5646 5647 5648 5649 5650 5651 5652 5653 5654 5655 5656 5657 5658 5659 5660 5661 5662 5663 5664 5665 5666 5667 5668 5669 5670 5671 5672 5673 5674 5675 5676 5677 5678 5679 5680 5681 5682 5683 5684 5685 5686 5687 5688 5689 5690 5691 5692 5693 5694 5695 5696 5697 5698 5699 5700 5701 5702 5703 5704 5705 5706 5707 5708 5709 5710 5711 5712 5713 5714 5715 5716 5717 5718 5719 5720 5721 5722 5723 5724 5725 5726 5727 5728 5729 5730 5731 5732 5733 5734 5735 5736 5737 5738 5739 5740 5741 5742 5743 5744 5745 5746 5747 5748 5749 5750 5751 5752 5753 5754 5755 5756 5757 5758 5759 5760 5761 5762 5763 5764 5765 5766 5767 5768 5769 5770 5771 5772 5773 5774 5775 5776 5777 5778 5779 5780 5781 5782 5783 5784 5785 5786 5787 5788 5789 5790 5791 5792 5793 5794 5795 5796 5797 5798 5799 5800 5801 5802 5803 5804 5805 5806 5807 5808 5809 5810 5811 5812 5813 5814 5815 5816 5817 5818 5819 5820 5821 5822 5823 5824 5825 5826 5827 5828 5829 5830 5831 5832 5833 5834 5835 5836 5837 5838 5839 5840 5841 5842 5843 5844 5845 5846 5847 5848 5849 5850 5851 5852 5853 5854 5855 5856 5857 5858 5859 5860 5861 5862 5863 5864 5865 5866 5867 5868 5869 5870 5871 5872 5873 5874 5875 5876 5877 5878 5879 5880 5881 5882 5883 5884 5885 5886 5887 5888 5889 5890 5891 5892 5893 5894 5895 5896 5897 5898 5899 5900 5901 5902 5903 5904 5905 5906 5907 5908 5909 5910 5911 5912 5913 5914 5915 5916 5917 5918 5919 5920 5921 5922 5923 5924 5925 5926 5927 5928 5929 5930 5931 5932 5933 5934 5935 5936 5937 5938 5939 5940 5941 5942 5943 5944 5945 5946 5947 5948 5949 5950 5951 5952 5953 5954 5955 5956 5957 5958 5959 5960 5961 5962 5963 5964 5965 5966 5967 5968 5969 5970 5971 5972 5973 5974 5975 5976 5977 5978 5979 5980 5981 5982 5983 5984 5985 5986 5987 5988 5989 5990 5991 5992 5993 5994 5995 5996 5997 5998 5999 6000 6001 6002 6003 6004 6005 6006 6007 6008 6009 6010 6011 6012 6013 6014 6015 6016 6017 6018 6019 6020 6021 6022 6023 6024 6025 6026 6027 6028 6029 6030 6031 6032 6033 6034 6035 6036 6037 6038 6039 6040 6041 6042 6043 6044 6045 6046 6047 6048 6049 6050 6051 6052 6053 6054 6055 6056 6057 6058 6059 6060 6061 6062 6063 6064 6065 6066 6067 6068 6069 6070 6071 6072 6073 6074 6075 6076 6077 6078 6079 6080 6081 6082 6083 6084 6085 6086 6087 6088 6089 6090 6091 6092 6093 6094 6095 6096 6097 6098 6099 6100 6101 6102 6103 6104 6105 6106 6107 6108 6109 6110 6111 6112 6113 6114 6115 6116 6117 6118 6119 6120 6121 6122 6123 6124 6125 6126 6127 6128 6129 6130 6131 6132 6133 6134 6135 6136 6137 6138 6139 6140 6141 6142 6143 6144 6145 6146 6147 6148 6149 6150 6151 6152 6153 6154 6155 6156 6157 6158 6159 6160 6161 6162 6163 6164 6165 6166 6167 6168 6169 6170 6171 6172 6173 6174 6175 6176 6177 6178 6179 6180 6181 6182 6183 6184 6185 6186 6187 6188 6189 6190 6191 6192 6193 6194 6195 6196 6197 6198 6199 6200 6201 6202 6203 6204 6205 6206 6207 6208 6209 6210 6211 6212 6213 6214 6215 6216 6217 6218 6219 6220 6221 6222 6223 6224 6225 6226 6227 6228 6229 6230 6231 6232 6233 6234 6235 6236 6237 6238 6239 6240 6241 6242 6243 6244 6245 6246 6247 6248 6249 6250 6251 6252 6253 6254 6255 6256 6257 6258 6259 6260 6261 6262 6263 6264 6265 6266 6267 6268 6269 6270 6271 6272 6273 6274 6275 6276 6277 6278 6279 6280 6281 6282 6283 6284 6285 6286 6287 6288 6289 6290 6291 6292 6293 6294 6295 6296 6297 6298 6299 6300 6301 6302 6303 6304 6305 6306 6307 6308 6309 6310 6311 6312 6313 6314 6315 6316 6317 6318 6319 6320 6321 6322 6323 6324 6325 6326 6327 6328 6329 6330 6331 6332 6333 6334 6335 6336 6337 6338 6339 6340 6341 6342 6343 6344 6345 6346 6347 6348 6349 6350 6351 6352 6353 6354 6355 6356 6357 6358 6359 6360 6361 6362 6363 6364 6365 6366 6367 6368 6369 6370 6371 6372 6373 6374 6375 6376 6377 6378 6379 6380 6381 6382 6383 6384 6385 6386 6387 6388 6389 6390 6391 6392 6393 6394 6395 6396 6397 6398 6399 6400 6401 6402 6403 6404 6405 6406 6407 6408 6409 6410 6411 6412 6413 6414 6415 6416 6417 6418 6419 6420 6421 6422 6423 6424 6425 6426 6427 6428 6429 6430 6431 6432 6433 6434 6435 6436 6437 6438 6439 6440 6441 6442 6443 6444 6445 6446 6447 6448 6449 6450 6451 6452 6453 6454 6455 6456 6457 6458 6459 6460 6461 6462 6463 6464 6465 6466 6467 6468 6469 6470 6471 6472 6473 6474 6475 6476 6477 6478 6479 6480 6481 6482 6483 6484 6485 6486 6487 6488 6489 6490 6491 6492 6493 6494 6495 6496 6497 6498 6499 6500 6501 6502 6503 6504 6505 6506 6507 6508 6509 6510 6511 6512 6513 6514 6515 6516 6517 6518 6519 6520 6521 6522 6523 6524 6525 6526 6527 6528 6529 6530 6531 6532 6533 6534 6535 6536 6537 6538 6539 6540 6541 6542 6543 6544 6545 6546 6547 6548 6549 6550 6551 6552 6553 6554 6555 6556 6557 6558 6559 6560 6561 6562 6563 6564 6565 6566 6567 6568 6569 6570 6571 6572 6573 6574 6575 6576 6577 6578 6579 6580 6581 6582 6583 6584 6585 6586 6587 6588 6589 6590 6591 6592 6593 6594 6595 6596 6597 6598 6599 6600 6601 6602 6603 6604 6605 6606 6607 6608 6609 6610 6611 6612 6613 6614 6615 6616 6617 6618 6619 6620 6621 6622 6623 6624 6625 6626 6627 6628 6629 6630 6631 6632 6633 6634 6635 6636 6637 6638 6639 6640 6641 6642 6643 6644 6645 6646 6647 6648 6649 6650 6651 6652 6653 6654 6655 6656 6657 6658 6659 6660 6661 6662 6663 6664 6665 6666 6667 6668 6669 6670 6671 6672 6673 6674 6675 6676 6677 6678 6679 6680 6681 6682 6683 6684 6685 6686 6687 6688 6689 6690 6691 6692 6693 6694 6695 6696 6697 6698 6699 6700 6701 6702 6703 6704 6705 6706 6707 6708 6709 6710 6711 6712 6713 6714 6715 6716 6717 6718 6719 6720 6721 6722 6723 6724 6725 6726 6727 6728 6729 6730 6731 6732 6733 6734 6735 6736 6737 6738 6739 6740 6741 6742 6743 6744 6745 6746 6747 6748 6749 6750 6751 6752 6753 6754 6755 6756 6757 6758 6759 6760 6761 6762 6763 6764 6765 6766 6767 6768 6769 6770 6771 6772 6773 6774 6775 6776 6777 6778 6779 6780 6781 6782 6783 6784 6785 6786 6787 6788 6789 6790 6791 6792 6793 6794 6795 6796 6797 6798 6799 6800 6801 6802 6803 6804 6805 6806 6807 6808 6809 6810 6811 6812 6813 6814 6815 6816 6817 6818 6819 6820 6821 6822 6823 6824 6825 6826 6827 6828 6829 6830 6831 6832 6833 6834 6835 6836 6837 6838 6839 6840 6841 6842 6843 6844 6845 6846 6847 6848 6849 6850 6851 6852 6853 6854 6855 6856 6857 6858 6859 6860 6861 6862 6863 6864 6865 6866 6867 6868 6869 6870 6871 6872 6873 6874 6875 6876 6877 6878 6879 6880 6881 6882 6883 6884 6885 6886 6887 6888 6889 6890 6891 6892 6893 6894 6895 6896 6897 6898 6899 6900 6901 6902 6903 6904 6905 6906 6907 6908 6909 6910 6911 6912 6913 6914 6915 6916 6917 6918 6919 6920 6921 6922 6923 6924 6925 6926 6927 6928 6929 6930 6931 6932 6933 6934 6935 6936 6937 6938 6939 6940 6941 6942 6943 6944 6945 6946 6947 6948 6949 6950 6951 6952 6953 6954 6955 6956 6957 6958 6959 6960 6961 6962 6963 6964 6965 6966 6967 6968 6969 6970 6971 6972 6973 6974 6975 6976 6977 6978 6979 6980 6981 6982 6983 6984 6985 6986 6987 6988 6989 6990 6991 6992 6993 6994 6995 6996 6997 6998 6999 7000 7001 7002 7003 7004 7005 7006 7007 7008 7009 7010 7011 7012 7013 7014 7015 7016 7017 7018 7019 7020 7021 7022 7023 7024 7025 7026 7027 7028 7029 7030 7031 7032 7033 7034 7035 7036 7037 7038 7039 7040 7041 7042 7043 7044 7045 7046 7047 7048 7049 7050 7051 7052 7053 7054 7055 7056 7057 7058 7059 7060 7061 7062 7063 7064 7065 7066 7067 7068 7069 7070 7071 7072 7073 7074 7075 7076 7077 7078 7079 7080 7081 7082 7083 7084 7085 7086 7087 7088 7089 7090 7091 7092 7093 7094 7095 7096 7097 7098 7099 7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 7116 7117 7118 7119 7120 7121 7122 7123 7124 7125 7126 7127 7128 7129 7130 7131 7132 7133 7134 7135 7136 7137 7138 7139 7140 7141 7142 7143 7144 7145 7146 7147 7148 7149 7150 7151 7152 7153 7154 7155 7156 7157 7158 7159 7160 7161 7162 7163 7164 7165 7166 7167 7168 7169 7170 7171 7172 7173 7174 7175 7176 7177 7178 7179 7180 7181 7182 7183 7184 7185 7186 7187 7188 7189 7190 7191 7192 7193 7194 7195 7196 7197 7198 7199 7200 7201 7202 7203 7204 7205 7206 7207 7208 7209 7210 7211 7212 7213 7214 7215 7216 7217 7218 7219 7220 7221 7222 7223 7224 7225 7226 7227 7228 7229 7230 7231 7232 7233 7234 7235 7236 7237 7238 7239 7240 7241 7242 7243 7244 7245 7246 7247 7248 7249 7250 7251 7252 7253 7254 7255 7256 7257 7258 7259 7260 7261 7262 7263 7264 7265 7266 7267 7268 7269 7270 7271 7272 7273 7274 7275 7276 7277 7278 7279 7280 7281 7282 7283 7284 7285 7286 7287 7288 7289 7290 7291 7292 7293 7294 7295 7296 7297 7298 7299 7300 7301 7302 7303 7304 7305 7306 7307 7308 | // SPDX-License-Identifier: GPL-2.0-only /* * linux/mm/memory.c * * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds */ /* * demand-loading started 01.12.91 - seems it is high on the list of * things wanted, and it should be easy to implement. - Linus */ /* * Ok, demand-loading was easy, shared pages a little bit tricker. Shared * pages started 02.12.91, seems to work. - Linus. * * Tested sharing by executing about 30 /bin/sh: under the old kernel it * would have taken more than the 6M I have free, but it worked well as * far as I could see. * * Also corrected some "invalidate()"s - I wasn't doing enough of them. */ /* * Real VM (paging to/from disk) started 18.12.91. Much more work and * thought has to go into this. Oh, well.. * 19.12.91 - works, somewhat. Sometimes I get faults, don't know why. * Found it. Everything seems to work now. * 20.12.91 - Ok, making the swap-device changeable like the root. */ /* * 05.04.94 - Multi-page memory management added for v1.1. * Idea by Alex Bligh (alex@cconcepts.co.uk) * * 16.07.99 - Support of BIGMEM added by Gerhard Wichert, Siemens AG * (Gerhard.Wichert@pdb.siemens.de) * * Aug/Sep 2004 Changed to four level page tables (Andi Kleen) */ #include <linux/kernel_stat.h> #include <linux/mm.h> #include <linux/mm_inline.h> #include <linux/sched/mm.h> #include <linux/sched/numa_balancing.h> #include <linux/sched/task.h> #include <linux/hugetlb.h> #include <linux/mman.h> #include <linux/swap.h> #include <linux/highmem.h> #include <linux/pagemap.h> #include <linux/memremap.h> #include <linux/kmsan.h> #include <linux/ksm.h> #include <linux/rmap.h> #include <linux/export.h> #include <linux/delayacct.h> #include <linux/init.h> #include <linux/writeback.h> #include <linux/memcontrol.h> #include <linux/mmu_notifier.h> #include <linux/swapops.h> #include <linux/elf.h> #include <linux/gfp.h> #include <linux/migrate.h> #include <linux/string.h> #include <linux/memory-tiers.h> #include <linux/debugfs.h> #include <linux/userfaultfd_k.h> #include <linux/dax.h> #include <linux/oom.h> #include <linux/numa.h> #include <linux/perf_event.h> #include <linux/ptrace.h> #include <linux/vmalloc.h> #include <linux/sched/sysctl.h> #include <trace/events/kmem.h> #include <asm/io.h> #include <asm/mmu_context.h> #include <asm/pgalloc.h> #include <linux/uaccess.h> #include <asm/tlb.h> #include <asm/tlbflush.h> #include "pgalloc-track.h" #include "internal.h" #include "swap.h" #if defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) && !defined(CONFIG_COMPILE_TEST) #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid. #endif static vm_fault_t do_fault(struct vm_fault *vmf); static vm_fault_t do_anonymous_page(struct vm_fault *vmf); static bool vmf_pte_changed(struct vm_fault *vmf); /* * Return true if the original pte was a uffd-wp pte marker (so the pte was * wr-protected). */ static __always_inline bool vmf_orig_pte_uffd_wp(struct vm_fault *vmf) { if (!userfaultfd_wp(vmf->vma)) return false; if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID)) return false; return pte_marker_uffd_wp(vmf->orig_pte); } /* * Randomize the address space (stacks, mmaps, brk, etc.). * * ( When CONFIG_COMPAT_BRK=y we exclude brk from randomization, * as ancient (libc5 based) binaries can segfault. ) */ int randomize_va_space __read_mostly = #ifdef CONFIG_COMPAT_BRK 1; #else 2; #endif static const struct ctl_table mmu_sysctl_table[] = { { .procname = "randomize_va_space", .data = &randomize_va_space, .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec, }, }; static int __init init_mm_sysctl(void) { register_sysctl_init("kernel", mmu_sysctl_table); return 0; } subsys_initcall(init_mm_sysctl); #ifndef arch_wants_old_prefaulted_pte static inline bool arch_wants_old_prefaulted_pte(void) { /* * Transitioning a PTE from 'old' to 'young' can be expensive on * some architectures, even if it's performed in hardware. By * default, "false" means prefaulted entries will be 'young'. */ return false; } #endif static int __init disable_randmaps(char *s) { randomize_va_space = 0; return 1; } __setup("norandmaps", disable_randmaps); unsigned long zero_pfn __read_mostly; EXPORT_SYMBOL(zero_pfn); unsigned long highest_memmap_pfn __read_mostly; /* * CONFIG_MMU architectures set up ZERO_PAGE in their paging_init() */ static int __init init_zero_pfn(void) { zero_pfn = page_to_pfn(ZERO_PAGE(0)); return 0; } early_initcall(init_zero_pfn); void mm_trace_rss_stat(struct mm_struct *mm, int member) { trace_rss_stat(mm, member); } /* * Note: this doesn't free the actual pages themselves. That * has been handled earlier when unmapping all the memory regions. */ static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, unsigned long addr) { pgtable_t token = pmd_pgtable(*pmd); pmd_clear(pmd); pte_free_tlb(tlb, token, addr); mm_dec_nr_ptes(tlb->mm); } static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling) { pmd_t *pmd; unsigned long next; unsigned long start; start = addr; pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); if (pmd_none_or_clear_bad(pmd)) continue; free_pte_range(tlb, pmd, addr); } while (pmd++, addr = next, addr != end); start &= PUD_MASK; if (start < floor) return; if (ceiling) { ceiling &= PUD_MASK; if (!ceiling) return; } if (end - 1 > ceiling - 1) return; pmd = pmd_offset(pud, start); pud_clear(pud); pmd_free_tlb(tlb, pmd, start); mm_dec_nr_pmds(tlb->mm); } static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling) { pud_t *pud; unsigned long next; unsigned long start; start = addr; pud = pud_offset(p4d, addr); do { next = pud_addr_end(addr, end); if (pud_none_or_clear_bad(pud)) continue; free_pmd_range(tlb, pud, addr, next, floor, ceiling); } while (pud++, addr = next, addr != end); start &= P4D_MASK; if (start < floor) return; if (ceiling) { ceiling &= P4D_MASK; if (!ceiling) return; } if (end - 1 > ceiling - 1) return; pud = pud_offset(p4d, start); p4d_clear(p4d); pud_free_tlb(tlb, pud, start); mm_dec_nr_puds(tlb->mm); } static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling) { p4d_t *p4d; unsigned long next; unsigned long start; start = addr; p4d = p4d_offset(pgd, addr); do { next = p4d_addr_end(addr, end); if (p4d_none_or_clear_bad(p4d)) continue; free_pud_range(tlb, p4d, addr, next, floor, ceiling); } while (p4d++, addr = next, addr != end); start &= PGDIR_MASK; if (start < floor) return; if (ceiling) { ceiling &= PGDIR_MASK; if (!ceiling) return; } if (end - 1 > ceiling - 1) return; p4d = p4d_offset(pgd, start); pgd_clear(pgd); p4d_free_tlb(tlb, p4d, start); } /** * free_pgd_range - Unmap and free page tables in the range * @tlb: the mmu_gather containing pending TLB flush info * @addr: virtual address start * @end: virtual address end * @floor: lowest address boundary * @ceiling: highest address boundary * * This function tears down all user-level page tables in the * specified virtual address range [@addr..@end). It is part of * the memory unmap flow. */ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling) { pgd_t *pgd; unsigned long next; /* * The next few lines have given us lots of grief... * * Why are we testing PMD* at this top level? Because often * there will be no work to do at all, and we'd prefer not to * go all the way down to the bottom just to discover that. * * Why all these "- 1"s? Because 0 represents both the bottom * of the address space and the top of it (using -1 for the * top wouldn't help much: the masks would do the wrong thing). * The rule is that addr 0 and floor 0 refer to the bottom of * the address space, but end 0 and ceiling 0 refer to the top * Comparisons need to use "end - 1" and "ceiling - 1" (though * that end 0 case should be mythical). * * Wherever addr is brought up or ceiling brought down, we must * be careful to reject "the opposite 0" before it confuses the * subsequent tests. But what about where end is brought down * by PMD_SIZE below? no, end can't go down to 0 there. * * Whereas we round start (addr) and ceiling down, by different * masks at different levels, in order to test whether a table * now has no other vmas using it, so can be freed, we don't * bother to round floor or end up - the tests don't need that. */ addr &= PMD_MASK; if (addr < floor) { addr += PMD_SIZE; if (!addr) return; } if (ceiling) { ceiling &= PMD_MASK; if (!ceiling) return; } if (end - 1 > ceiling - 1) end -= PMD_SIZE; if (addr > end - 1) return; /* * We add page table cache pages with PAGE_SIZE, * (see pte_free_tlb()), flush the tlb if we need */ tlb_change_page_size(tlb, PAGE_SIZE); pgd = pgd_offset(tlb->mm, addr); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) continue; free_p4d_range(tlb, pgd, addr, next, floor, ceiling); } while (pgd++, addr = next, addr != end); } void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, struct vm_area_struct *vma, unsigned long floor, unsigned long ceiling, bool mm_wr_locked) { struct unlink_vma_file_batch vb; tlb_free_vmas(tlb); do { unsigned long addr = vma->vm_start; struct vm_area_struct *next; /* * Note: USER_PGTABLES_CEILING may be passed as ceiling and may * be 0. This will underflow and is okay. */ next = mas_find(mas, ceiling - 1); if (unlikely(xa_is_zero(next))) next = NULL; /* * Hide vma from rmap and truncate_pagecache before freeing * pgtables */ if (mm_wr_locked) vma_start_write(vma); unlink_anon_vmas(vma); unlink_file_vma_batch_init(&vb); unlink_file_vma_batch_add(&vb, vma); /* * Optimization: gather nearby vmas into one call down */ while (next && next->vm_start <= vma->vm_end + PMD_SIZE) { vma = next; next = mas_find(mas, ceiling - 1); if (unlikely(xa_is_zero(next))) next = NULL; if (mm_wr_locked) vma_start_write(vma); unlink_anon_vmas(vma); unlink_file_vma_batch_add(&vb, vma); } unlink_file_vma_batch_final(&vb); free_pgd_range(tlb, addr, vma->vm_end, floor, next ? next->vm_start : ceiling); vma = next; } while (vma); } void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte) { spinlock_t *ptl = pmd_lock(mm, pmd); if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ mm_inc_nr_ptes(mm); /* * Ensure all pte setup (eg. pte page lock and page clearing) are * visible before the pte is made visible to other CPUs by being * put into page tables. * * The other side of the story is the pointer chasing in the page * table walking code (when walking the page table without locking; * ie. most of the time). Fortunately, these data accesses consist * of a chain of data-dependent loads, meaning most CPUs (alpha * being the notable exception) will already guarantee loads are * seen in-order. See the alpha page table accessors for the * smp_rmb() barriers in page table walking code. */ smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */ pmd_populate(mm, pmd, *pte); *pte = NULL; } spin_unlock(ptl); } int __pte_alloc(struct mm_struct *mm, pmd_t *pmd) { pgtable_t new = pte_alloc_one(mm); if (!new) return -ENOMEM; pmd_install(mm, pmd, &new); if (new) pte_free(mm, new); return 0; } int __pte_alloc_kernel(pmd_t *pmd) { pte_t *new = pte_alloc_one_kernel(&init_mm); if (!new) return -ENOMEM; spin_lock(&init_mm.page_table_lock); if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ smp_wmb(); /* See comment in pmd_install() */ pmd_populate_kernel(&init_mm, pmd, new); new = NULL; } spin_unlock(&init_mm.page_table_lock); if (new) pte_free_kernel(&init_mm, new); return 0; } static inline void init_rss_vec(int *rss) { memset(rss, 0, sizeof(int) * NR_MM_COUNTERS); } static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss) { int i; for (i = 0; i < NR_MM_COUNTERS; i++) if (rss[i]) add_mm_counter(mm, i, rss[i]); } static bool is_bad_page_map_ratelimited(void) { static unsigned long resume; static unsigned long nr_shown; static unsigned long nr_unshown; /* * Allow a burst of 60 reports, then keep quiet for that minute; * or allow a steady drip of one report per second. */ if (nr_shown == 60) { if (time_before(jiffies, resume)) { nr_unshown++; return true; } if (nr_unshown) { pr_alert("BUG: Bad page map: %lu messages suppressed\n", nr_unshown); nr_unshown = 0; } nr_shown = 0; } if (nr_shown++ == 0) resume = jiffies + 60 * HZ; return false; } static void __print_bad_page_map_pgtable(struct mm_struct *mm, unsigned long addr) { unsigned long long pgdv, p4dv, pudv, pmdv; p4d_t p4d, *p4dp; pud_t pud, *pudp; pmd_t pmd, *pmdp; pgd_t *pgdp; /* * Although this looks like a fully lockless pgtable walk, it is not: * see locking requirements for print_bad_page_map(). */ pgdp = pgd_offset(mm, addr); pgdv = pgd_val(*pgdp); if (!pgd_present(*pgdp) || pgd_leaf(*pgdp)) { pr_alert("pgd:%08llx\n", pgdv); return; } p4dp = p4d_offset(pgdp, addr); p4d = p4dp_get(p4dp); p4dv = p4d_val(p4d); if (!p4d_present(p4d) || p4d_leaf(p4d)) { pr_alert("pgd:%08llx p4d:%08llx\n", pgdv, p4dv); return; } pudp = pud_offset(p4dp, addr); pud = pudp_get(pudp); pudv = pud_val(pud); if (!pud_present(pud) || pud_leaf(pud)) { pr_alert("pgd:%08llx p4d:%08llx pud:%08llx\n", pgdv, p4dv, pudv); return; } pmdp = pmd_offset(pudp, addr); pmd = pmdp_get(pmdp); pmdv = pmd_val(pmd); /* * Dumping the PTE would be nice, but it's tricky with CONFIG_HIGHPTE, * because the table should already be mapped by the caller and * doing another map would be bad. print_bad_page_map() should * already take care of printing the PTE. */ pr_alert("pgd:%08llx p4d:%08llx pud:%08llx pmd:%08llx\n", pgdv, p4dv, pudv, pmdv); } /* * This function is called to print an error when a bad page table entry (e.g., * corrupted page table entry) is found. For example, we might have a * PFN-mapped pte in a region that doesn't allow it. * * The calling function must still handle the error. * * This function must be called during a proper page table walk, as it will * re-walk the page table to dump information: the caller MUST prevent page * table teardown (by holding mmap, vma or rmap lock) and MUST hold the leaf * page table lock. */ static void print_bad_page_map(struct vm_area_struct *vma, unsigned long addr, unsigned long long entry, struct page *page, enum pgtable_level level) { struct address_space *mapping; pgoff_t index; if (is_bad_page_map_ratelimited()) return; mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL; index = linear_page_index(vma, addr); pr_alert("BUG: Bad page map in process %s %s:%08llx", current->comm, pgtable_level_to_str(level), entry); __print_bad_page_map_pgtable(vma->vm_mm, addr); if (page) dump_page(page, "bad page map"); pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n", (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index); pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n", vma->vm_file, vma->vm_ops ? vma->vm_ops->fault : NULL, vma->vm_file ? vma->vm_file->f_op->mmap : NULL, vma->vm_file ? vma->vm_file->f_op->mmap_prepare : NULL, mapping ? mapping->a_ops->read_folio : NULL); dump_stack(); add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); } #define print_bad_pte(vma, addr, pte, page) \ print_bad_page_map(vma, addr, pte_val(pte), page, PGTABLE_LEVEL_PTE) /** * __vm_normal_page() - Get the "struct page" associated with a page table entry. * @vma: The VMA mapping the page table entry. * @addr: The address where the page table entry is mapped. * @pfn: The PFN stored in the page table entry. * @special: Whether the page table entry is marked "special". * @level: The page table level for error reporting purposes only. * @entry: The page table entry value for error reporting purposes only. * * "Special" mappings do not wish to be associated with a "struct page" (either * it doesn't exist, or it exists but they don't want to touch it). In this * case, NULL is returned here. "Normal" mappings do have a struct page and * are ordinarily refcounted. * * Page mappings of the shared zero folios are always considered "special", as * they are not ordinarily refcounted: neither the refcount nor the mapcount * of these folios is adjusted when mapping them into user page tables. * Selected page table walkers (such as GUP) can still identify mappings of the * shared zero folios and work with the underlying "struct page". * * There are 2 broad cases. Firstly, an architecture may define a "special" * page table entry bit, such as pte_special(), in which case this function is * trivial. Secondly, an architecture may not have a spare page table * entry bit, which requires a more complicated scheme, described below. * * With CONFIG_FIND_NORMAL_PAGE, we might have the "special" bit set on * page table entries that actually map "normal" pages: however, that page * cannot be looked up through the PFN stored in the page table entry, but * instead will be looked up through vm_ops->find_normal_page(). So far, this * only applies to PTEs. * * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a * special mapping (even if there are underlying and valid "struct pages"). * COWed pages of a VM_PFNMAP are always normal. * * The way we recognize COWed pages within VM_PFNMAP mappings is through the * rules set up by "remap_pfn_range()": the vma will have the VM_PFNMAP bit * set, and the vm_pgoff will point to the first PFN mapped: thus every special * mapping will always honor the rule * * pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT) * * And for normal mappings this is false. * * This restricts such mappings to be a linear translation from virtual address * to pfn. To get around this restriction, we allow arbitrary mappings so long * as the vma is not a COW mapping; in that case, we know that all ptes are * special (because none can have been COWed). * * * In order to support COW of arbitrary special mappings, we have VM_MIXEDMAP. * * VM_MIXEDMAP mappings can likewise contain memory with or without "struct * page" backing, however the difference is that _all_ pages with a struct * page (that is, those where pfn_valid is true, except the shared zero * folios) are refcounted and considered normal pages by the VM. * * The disadvantage is that pages are refcounted (which can be slower and * simply not an option for some PFNMAP users). The advantage is that we * don't have to follow the strict linearity rule of PFNMAP mappings in * order to support COWable mappings. * * Return: Returns the "struct page" if this is a "normal" mapping. Returns * NULL if this is a "special" mapping. */ static inline struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, bool special, unsigned long long entry, enum pgtable_level level) { if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) { if (unlikely(special)) { #ifdef CONFIG_FIND_NORMAL_PAGE if (vma->vm_ops && vma->vm_ops->find_normal_page) return vma->vm_ops->find_normal_page(vma, addr); #endif /* CONFIG_FIND_NORMAL_PAGE */ if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) return NULL; if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn)) return NULL; print_bad_page_map(vma, addr, entry, NULL, level); return NULL; } /* * With CONFIG_ARCH_HAS_PTE_SPECIAL, any special page table * mappings (incl. shared zero folios) are marked accordingly. */ } else { if (unlikely(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))) { if (vma->vm_flags & VM_MIXEDMAP) { /* If it has a "struct page", it's "normal". */ if (!pfn_valid(pfn)) return NULL; } else { unsigned long off = (addr - vma->vm_start) >> PAGE_SHIFT; /* Only CoW'ed anon folios are "normal". */ if (pfn == vma->vm_pgoff + off) return NULL; if (!is_cow_mapping(vma->vm_flags)) return NULL; } } if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn)) return NULL; } if (unlikely(pfn > highest_memmap_pfn)) { /* Corrupted page table entry. */ print_bad_page_map(vma, addr, entry, NULL, level); return NULL; } /* * NOTE! We still have PageReserved() pages in the page tables. * For example, VDSO mappings can cause them to exist. */ VM_WARN_ON_ONCE(is_zero_pfn(pfn) || is_huge_zero_pfn(pfn)); return pfn_to_page(pfn); } /** * vm_normal_page() - Get the "struct page" associated with a PTE * @vma: The VMA mapping the @pte. * @addr: The address where the @pte is mapped. * @pte: The PTE. * * Get the "struct page" associated with a PTE. See __vm_normal_page() * for details on "normal" and "special" mappings. * * Return: Returns the "struct page" if this is a "normal" mapping. Returns * NULL if this is a "special" mapping. */ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte) { return __vm_normal_page(vma, addr, pte_pfn(pte), pte_special(pte), pte_val(pte), PGTABLE_LEVEL_PTE); } /** * vm_normal_folio() - Get the "struct folio" associated with a PTE * @vma: The VMA mapping the @pte. * @addr: The address where the @pte is mapped. * @pte: The PTE. * * Get the "struct folio" associated with a PTE. See __vm_normal_page() * for details on "normal" and "special" mappings. * * Return: Returns the "struct folio" if this is a "normal" mapping. Returns * NULL if this is a "special" mapping. */ struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr, pte_t pte) { struct page *page = vm_normal_page(vma, addr, pte); if (page) return page_folio(page); return NULL; } #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES /** * vm_normal_page_pmd() - Get the "struct page" associated with a PMD * @vma: The VMA mapping the @pmd. * @addr: The address where the @pmd is mapped. * @pmd: The PMD. * * Get the "struct page" associated with a PTE. See __vm_normal_page() * for details on "normal" and "special" mappings. * * Return: Returns the "struct page" if this is a "normal" mapping. Returns * NULL if this is a "special" mapping. */ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t pmd) { return __vm_normal_page(vma, addr, pmd_pfn(pmd), pmd_special(pmd), pmd_val(pmd), PGTABLE_LEVEL_PMD); } /** * vm_normal_folio_pmd() - Get the "struct folio" associated with a PMD * @vma: The VMA mapping the @pmd. * @addr: The address where the @pmd is mapped. * @pmd: The PMD. * * Get the "struct folio" associated with a PTE. See __vm_normal_page() * for details on "normal" and "special" mappings. * * Return: Returns the "struct folio" if this is a "normal" mapping. Returns * NULL if this is a "special" mapping. */ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t pmd) { struct page *page = vm_normal_page_pmd(vma, addr, pmd); if (page) return page_folio(page); return NULL; } /** * vm_normal_page_pud() - Get the "struct page" associated with a PUD * @vma: The VMA mapping the @pud. * @addr: The address where the @pud is mapped. * @pud: The PUD. * * Get the "struct page" associated with a PUD. See __vm_normal_page() * for details on "normal" and "special" mappings. * * Return: Returns the "struct page" if this is a "normal" mapping. Returns * NULL if this is a "special" mapping. */ struct page *vm_normal_page_pud(struct vm_area_struct *vma, unsigned long addr, pud_t pud) { return __vm_normal_page(vma, addr, pud_pfn(pud), pud_special(pud), pud_val(pud), PGTABLE_LEVEL_PUD); } #endif /** * restore_exclusive_pte - Restore a device-exclusive entry * @vma: VMA covering @address * @folio: the mapped folio * @page: the mapped folio page * @address: the virtual address * @ptep: pte pointer into the locked page table mapping the folio page * @orig_pte: pte value at @ptep * * Restore a device-exclusive non-swap entry to an ordinary present pte. * * The folio and the page table must be locked, and MMU notifiers must have * been called to invalidate any (exclusive) device mappings. * * Locking the folio makes sure that anybody who just converted the pte to * a device-exclusive entry can map it into the device to make forward * progress without others converting it back until the folio was unlocked. * * If the folio lock ever becomes an issue, we can stop relying on the folio * lock; it might make some scenarios with heavy thrashing less likely to * make forward progress, but these scenarios might not be valid use cases. * * Note that the folio lock does not protect against all cases of concurrent * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers * must use MMU notifiers to sync against any concurrent changes. */ static void restore_exclusive_pte(struct vm_area_struct *vma, struct folio *folio, struct page *page, unsigned long address, pte_t *ptep, pte_t orig_pte) { pte_t pte; VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot))); if (pte_swp_soft_dirty(orig_pte)) pte = pte_mksoft_dirty(pte); if (pte_swp_uffd_wp(orig_pte)) pte = pte_mkuffd_wp(pte); if ((vma->vm_flags & VM_WRITE) && can_change_pte_writable(vma, address, pte)) { if (folio_test_dirty(folio)) pte = pte_mkdirty(pte); pte = pte_mkwrite(pte, vma); } set_pte_at(vma->vm_mm, address, ptep, pte); /* * No need to invalidate - it was non-present before. However * secondary CPUs may have mappings that need invalidating. */ update_mmu_cache(vma, address, ptep); } /* * Tries to restore an exclusive pte if the page lock can be acquired without * sleeping. */ static int try_restore_exclusive_pte(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t orig_pte) { struct page *page = pfn_swap_entry_to_page(pte_to_swp_entry(orig_pte)); struct folio *folio = page_folio(page); if (folio_trylock(folio)) { restore_exclusive_pte(vma, folio, page, addr, ptep, orig_pte); folio_unlock(folio); return 0; } return -EBUSY; } /* * copy one vm_area from one task to the other. Assumes the page tables * already present in the new task to be cleared in the whole range * covered by this vma. */ static unsigned long copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, int *rss) { vm_flags_t vm_flags = dst_vma->vm_flags; pte_t orig_pte = ptep_get(src_pte); pte_t pte = orig_pte; struct folio *folio; struct page *page; swp_entry_t entry = pte_to_swp_entry(orig_pte); if (likely(!non_swap_entry(entry))) { if (swap_duplicate(entry) < 0) return -EIO; /* make sure dst_mm is on swapoff's mmlist. */ if (unlikely(list_empty(&dst_mm->mmlist))) { spin_lock(&mmlist_lock); if (list_empty(&dst_mm->mmlist)) list_add(&dst_mm->mmlist, &src_mm->mmlist); spin_unlock(&mmlist_lock); } /* Mark the swap entry as shared. */ if (pte_swp_exclusive(orig_pte)) { pte = pte_swp_clear_exclusive(orig_pte); set_pte_at(src_mm, addr, src_pte, pte); } rss[MM_SWAPENTS]++; } else if (is_migration_entry(entry)) { folio = pfn_swap_entry_folio(entry); rss[mm_counter(folio)]++; if (!is_readable_migration_entry(entry) && is_cow_mapping(vm_flags)) { /* * COW mappings require pages in both parent and child * to be set to read. A previously exclusive entry is * now shared. */ entry = make_readable_migration_entry( swp_offset(entry)); pte = swp_entry_to_pte(entry); if (pte_swp_soft_dirty(orig_pte)) pte = pte_swp_mksoft_dirty(pte); if (pte_swp_uffd_wp(orig_pte)) pte = pte_swp_mkuffd_wp(pte); set_pte_at(src_mm, addr, src_pte, pte); } } else if (is_device_private_entry(entry)) { page = pfn_swap_entry_to_page(entry); folio = page_folio(page); /* * Update rss count even for unaddressable pages, as * they should treated just like normal pages in this * respect. * * We will likely want to have some new rss counters * for unaddressable pages, at some point. But for now * keep things as they are. */ folio_get(folio); rss[mm_counter(folio)]++; /* Cannot fail as these pages cannot get pinned. */ folio_try_dup_anon_rmap_pte(folio, page, dst_vma, src_vma); /* * We do not preserve soft-dirty information, because so * far, checkpoint/restore is the only feature that * requires that. And checkpoint/restore does not work * when a device driver is involved (you cannot easily * save and restore device driver state). */ if (is_writable_device_private_entry(entry) && is_cow_mapping(vm_flags)) { entry = make_readable_device_private_entry( swp_offset(entry)); pte = swp_entry_to_pte(entry); if (pte_swp_uffd_wp(orig_pte)) pte = pte_swp_mkuffd_wp(pte); set_pte_at(src_mm, addr, src_pte, pte); } } else if (is_device_exclusive_entry(entry)) { /* * Make device exclusive entries present by restoring the * original entry then copying as for a present pte. Device * exclusive entries currently only support private writable * (ie. COW) mappings. */ VM_BUG_ON(!is_cow_mapping(src_vma->vm_flags)); if (try_restore_exclusive_pte(src_vma, addr, src_pte, orig_pte)) return -EBUSY; return -ENOENT; } else if (is_pte_marker_entry(entry)) { pte_marker marker = copy_pte_marker(entry, dst_vma); if (marker) set_pte_at(dst_mm, addr, dst_pte, make_pte_marker(marker)); return 0; } if (!userfaultfd_wp(dst_vma)) pte = pte_swp_clear_uffd_wp(pte); set_pte_at(dst_mm, addr, dst_pte, pte); return 0; } /* * Copy a present and normal page. * * NOTE! The usual case is that this isn't required; * instead, the caller can just increase the page refcount * and re-use the pte the traditional way. * * And if we need a pre-allocated page but don't yet have * one, return a negative error to let the preallocation * code know so that it can do so outside the page table * lock. */ static inline int copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss, struct folio **prealloc, struct page *page) { struct folio *new_folio; pte_t pte; new_folio = *prealloc; if (!new_folio) return -EAGAIN; /* * We have a prealloc page, all good! Take it * over and copy the page & arm it. */ if (copy_mc_user_highpage(&new_folio->page, page, addr, src_vma)) return -EHWPOISON; *prealloc = NULL; __folio_mark_uptodate(new_folio); folio_add_new_anon_rmap(new_folio, dst_vma, addr, RMAP_EXCLUSIVE); folio_add_lru_vma(new_folio, dst_vma); rss[MM_ANONPAGES]++; /* All done, just insert the new page copy in the child */ pte = folio_mk_pte(new_folio, dst_vma->vm_page_prot); pte = maybe_mkwrite(pte_mkdirty(pte), dst_vma); if (userfaultfd_pte_wp(dst_vma, ptep_get(src_pte))) /* Uffd-wp needs to be delivered to dest pte as well */ pte = pte_mkuffd_wp(pte); set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); return 0; } static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte, pte_t pte, unsigned long addr, int nr) { struct mm_struct *src_mm = src_vma->vm_mm; /* If it's a COW mapping, write protect it both processes. */ if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) { wrprotect_ptes(src_mm, addr, src_pte, nr); pte = pte_wrprotect(pte); } /* If it's a shared mapping, mark it clean in the child. */ if (src_vma->vm_flags & VM_SHARED) pte = pte_mkclean(pte); pte = pte_mkold(pte); if (!userfaultfd_wp(dst_vma)) pte = pte_clear_uffd_wp(pte); set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr); } /* * Copy one present PTE, trying to batch-process subsequent PTEs that map * consecutive pages of the same folio by copying them as well. * * Returns -EAGAIN if one preallocated page is required to copy the next PTE. * Otherwise, returns the number of copied PTEs (at least 1). */ static inline int copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte, pte_t pte, unsigned long addr, int max_nr, int *rss, struct folio **prealloc) { fpb_t flags = FPB_MERGE_WRITE; struct page *page; struct folio *folio; int err, nr; page = vm_normal_page(src_vma, addr, pte); if (unlikely(!page)) goto copy_pte; folio = page_folio(page); /* * If we likely have to copy, just don't bother with batching. Make * sure that the common "small folio" case is as fast as possible * by keeping the batching logic separate. */ if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) { if (!(src_vma->vm_flags & VM_SHARED)) flags |= FPB_RESPECT_DIRTY; if (vma_soft_dirty_enabled(src_vma)) flags |= FPB_RESPECT_SOFT_DIRTY; nr = folio_pte_batch_flags(folio, src_vma, src_pte, &pte, max_nr, flags); folio_ref_add(folio, nr); if (folio_test_anon(folio)) { if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page, nr, dst_vma, src_vma))) { folio_ref_sub(folio, nr); return -EAGAIN; } rss[MM_ANONPAGES] += nr; VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio); } else { folio_dup_file_rmap_ptes(folio, page, nr, dst_vma); rss[mm_counter_file(folio)] += nr; } __copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte, addr, nr); return nr; } folio_get(folio); if (folio_test_anon(folio)) { /* * If this page may have been pinned by the parent process, * copy the page immediately for the child so that we'll always * guarantee the pinned page won't be randomly replaced in the * future. */ if (unlikely(folio_try_dup_anon_rmap_pte(folio, page, dst_vma, src_vma))) { /* Page may be pinned, we have to copy. */ folio_put(folio); err = copy_present_page(dst_vma, src_vma, dst_pte, src_pte, addr, rss, prealloc, page); return err ? err : 1; } rss[MM_ANONPAGES]++; VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio); } else { folio_dup_file_rmap_pte(folio, page, dst_vma); rss[mm_counter_file(folio)]++; } copy_pte: __copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte, addr, 1); return 1; } static inline struct folio *folio_prealloc(struct mm_struct *src_mm, struct vm_area_struct *vma, unsigned long addr, bool need_zero) { struct folio *new_folio; if (need_zero) new_folio = vma_alloc_zeroed_movable_folio(vma, addr); else new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); if (!new_folio) return NULL; if (mem_cgroup_charge(new_folio, src_mm, GFP_KERNEL)) { folio_put(new_folio); return NULL; } folio_throttle_swaprate(new_folio, GFP_KERNEL); return new_folio; } static int copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, unsigned long end) { struct mm_struct *dst_mm = dst_vma->vm_mm; struct mm_struct *src_mm = src_vma->vm_mm; pte_t *orig_src_pte, *orig_dst_pte; pte_t *src_pte, *dst_pte; pmd_t dummy_pmdval; pte_t ptent; spinlock_t *src_ptl, *dst_ptl; int progress, max_nr, ret = 0; int rss[NR_MM_COUNTERS]; swp_entry_t entry = (swp_entry_t){0}; struct folio *prealloc = NULL; int nr; again: progress = 0; init_rss_vec(rss); /* * copy_pmd_range()'s prior pmd_none_or_clear_bad(src_pmd), and the * error handling here, assume that exclusive mmap_lock on dst and src * protects anon from unexpected THP transitions; with shmem and file * protected by mmap_lock-less collapse skipping areas with anon_vma * (whereas vma_needs_copy() skips areas without anon_vma). A rework * can remove such assumptions later, but this is good enough for now. */ dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl); if (!dst_pte) { ret = -ENOMEM; goto out; } /* * We already hold the exclusive mmap_lock, the copy_pte_range() and * retract_page_tables() are using vma->anon_vma to be exclusive, so * the PTE page is stable, and there is no need to get pmdval and do * pmd_same() check. */ src_pte = pte_offset_map_rw_nolock(src_mm, src_pmd, addr, &dummy_pmdval, &src_ptl); if (!src_pte) { pte_unmap_unlock(dst_pte, dst_ptl); /* ret == 0 */ goto out; } spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); orig_src_pte = src_pte; orig_dst_pte = dst_pte; arch_enter_lazy_mmu_mode(); do { nr = 1; /* * We are holding two locks at this point - either of them * could generate latencies in another task on another CPU. */ if (progress >= 32) { progress = 0; if (need_resched() || spin_needbreak(src_ptl) || spin_needbreak(dst_ptl)) break; } ptent = ptep_get(src_pte); if (pte_none(ptent)) { progress++; continue; } if (unlikely(!pte_present(ptent))) { ret = copy_nonpresent_pte(dst_mm, src_mm, dst_pte, src_pte, dst_vma, src_vma, addr, rss); if (ret == -EIO) { entry = pte_to_swp_entry(ptep_get(src_pte)); break; } else if (ret == -EBUSY) { break; } else if (!ret) { progress += 8; continue; } ptent = ptep_get(src_pte); VM_WARN_ON_ONCE(!pte_present(ptent)); /* * Device exclusive entry restored, continue by copying * the now present pte. */ WARN_ON_ONCE(ret != -ENOENT); } /* copy_present_ptes() will clear `*prealloc' if consumed */ max_nr = (end - addr) / PAGE_SIZE; ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, ptent, addr, max_nr, rss, &prealloc); /* * If we need a pre-allocated page for this pte, drop the * locks, allocate, and try again. * If copy failed due to hwpoison in source page, break out. */ if (unlikely(ret == -EAGAIN || ret == -EHWPOISON)) break; if (unlikely(prealloc)) { /* * pre-alloc page cannot be reused by next time so as * to strictly follow mempolicy (e.g., alloc_page_vma() * will allocate page according to address). This * could only happen if one pinned pte changed. */ folio_put(prealloc); prealloc = NULL; } nr = ret; progress += 8 * nr; } while (dst_pte += nr, src_pte += nr, addr += PAGE_SIZE * nr, addr != end); arch_leave_lazy_mmu_mode(); pte_unmap_unlock(orig_src_pte, src_ptl); add_mm_rss_vec(dst_mm, rss); pte_unmap_unlock(orig_dst_pte, dst_ptl); cond_resched(); if (ret == -EIO) { VM_WARN_ON_ONCE(!entry.val); if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) { ret = -ENOMEM; goto out; } entry.val = 0; } else if (ret == -EBUSY || unlikely(ret == -EHWPOISON)) { goto out; } else if (ret == -EAGAIN) { prealloc = folio_prealloc(src_mm, src_vma, addr, false); if (!prealloc) return -ENOMEM; } else if (ret < 0) { VM_WARN_ON_ONCE(1); } /* We've captured and resolved the error. Reset, try again. */ ret = 0; if (addr != end) goto again; out: if (unlikely(prealloc)) folio_put(prealloc); return ret; } static inline int copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pud_t *dst_pud, pud_t *src_pud, unsigned long addr, unsigned long end) { struct mm_struct *dst_mm = dst_vma->vm_mm; struct mm_struct *src_mm = src_vma->vm_mm; pmd_t *src_pmd, *dst_pmd; unsigned long next; dst_pmd = pmd_alloc(dst_mm, dst_pud, addr); if (!dst_pmd) return -ENOMEM; src_pmd = pmd_offset(src_pud, addr); do { next = pmd_addr_end(addr, end); if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)) { int err; VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma); err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd, addr, dst_vma, src_vma); if (err == -ENOMEM) return -ENOMEM; if (!err) continue; /* fall through */ } if (pmd_none_or_clear_bad(src_pmd)) continue; if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd, addr, next)) return -ENOMEM; } while (dst_pmd++, src_pmd++, addr = next, addr != end); return 0; } static inline int copy_pud_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, p4d_t *dst_p4d, p4d_t *src_p4d, unsigned long addr, unsigned long end) { struct mm_struct *dst_mm = dst_vma->vm_mm; struct mm_struct *src_mm = src_vma->vm_mm; pud_t *src_pud, *dst_pud; unsigned long next; dst_pud = pud_alloc(dst_mm, dst_p4d, addr); if (!dst_pud) return -ENOMEM; src_pud = pud_offset(src_p4d, addr); do { next = pud_addr_end(addr, end); if (pud_trans_huge(*src_pud)) { int err; VM_BUG_ON_VMA(next-addr != HPAGE_PUD_SIZE, src_vma); err = copy_huge_pud(dst_mm, src_mm, dst_pud, src_pud, addr, src_vma); if (err == -ENOMEM) return -ENOMEM; if (!err) continue; /* fall through */ } if (pud_none_or_clear_bad(src_pud)) continue; if (copy_pmd_range(dst_vma, src_vma, dst_pud, src_pud, addr, next)) return -ENOMEM; } while (dst_pud++, src_pud++, addr = next, addr != end); return 0; } static inline int copy_p4d_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pgd_t *dst_pgd, pgd_t *src_pgd, unsigned long addr, unsigned long end) { struct mm_struct *dst_mm = dst_vma->vm_mm; p4d_t *src_p4d, *dst_p4d; unsigned long next; dst_p4d = p4d_alloc(dst_mm, dst_pgd, addr); if (!dst_p4d) return -ENOMEM; src_p4d = p4d_offset(src_pgd, addr); do { next = p4d_addr_end(addr, end); if (p4d_none_or_clear_bad(src_p4d)) continue; if (copy_pud_range(dst_vma, src_vma, dst_p4d, src_p4d, addr, next)) return -ENOMEM; } while (dst_p4d++, src_p4d++, addr = next, addr != end); return 0; } /* * Return true if the vma needs to copy the pgtable during this fork(). Return * false when we can speed up fork() by allowing lazy page faults later until * when the child accesses the memory range. */ static bool vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) { /* * Always copy pgtables when dst_vma has uffd-wp enabled even if it's * file-backed (e.g. shmem). Because when uffd-wp is enabled, pgtable * contains uffd-wp protection information, that's something we can't * retrieve from page cache, and skip copying will lose those info. */ if (userfaultfd_wp(dst_vma)) return true; if (src_vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) return true; if (src_vma->anon_vma) return true; /* * Don't copy ptes where a page fault will fill them correctly. Fork * becomes much lighter when there are big shared or private readonly * mappings. The tradeoff is that copy_page_range is more efficient * than faulting. */ return false; } int copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) { pgd_t *src_pgd, *dst_pgd; unsigned long addr = src_vma->vm_start; unsigned long end = src_vma->vm_end; struct mm_struct *dst_mm = dst_vma->vm_mm; struct mm_struct *src_mm = src_vma->vm_mm; struct mmu_notifier_range range; unsigned long next; bool is_cow; int ret; if (!vma_needs_copy(dst_vma, src_vma)) return 0; if (is_vm_hugetlb_page(src_vma)) return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma); /* * We need to invalidate the secondary MMU mappings only when * there could be a permission downgrade on the ptes of the * parent mm. And a permission downgrade will only happen if * is_cow_mapping() returns true. */ is_cow = is_cow_mapping(src_vma->vm_flags); if (is_cow) { mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, 0, src_mm, addr, end); mmu_notifier_invalidate_range_start(&range); /* * Disabling preemption is not needed for the write side, as * the read side doesn't spin, but goes to the mmap_lock. * * Use the raw variant of the seqcount_t write API to avoid * lockdep complaining about preemptibility. */ vma_assert_write_locked(src_vma); raw_write_seqcount_begin(&src_mm->write_protect_seq); } ret = 0; dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(src_pgd)) continue; if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd, addr, next))) { ret = -ENOMEM; break; } } while (dst_pgd++, src_pgd++, addr = next, addr != end); if (is_cow) { raw_write_seqcount_end(&src_mm->write_protect_seq); mmu_notifier_invalidate_range_end(&range); } return ret; } /* Whether we should zap all COWed (private) pages too */ static inline bool should_zap_cows(struct zap_details *details) { /* By default, zap all pages */ if (!details || details->reclaim_pt) return true; /* Or, we zap COWed pages only if the caller wants to */ return details->even_cows; } /* Decides whether we should zap this folio with the folio pointer specified */ static inline bool should_zap_folio(struct zap_details *details, struct folio *folio) { /* If we can make a decision without *folio.. */ if (should_zap_cows(details)) return true; /* Otherwise we should only zap non-anon folios */ return !folio_test_anon(folio); } static inline bool zap_drop_markers(struct zap_details *details) { if (!details) return false; return details->zap_flags & ZAP_FLAG_DROP_MARKER; } /* * This function makes sure that we'll replace the none pte with an uffd-wp * swap special pte marker when necessary. Must be with the pgtable lock held. * * Returns true if uffd-wp ptes was installed, false otherwise. */ static inline bool zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr, pte_t *pte, int nr, struct zap_details *details, pte_t pteval) { bool was_installed = false; #ifdef CONFIG_PTE_MARKER_UFFD_WP /* Zap on anonymous always means dropping everything */ if (vma_is_anonymous(vma)) return false; if (zap_drop_markers(details)) return false; for (;;) { /* the PFN in the PTE is irrelevant. */ if (pte_install_uffd_wp_if_needed(vma, addr, pte, pteval)) was_installed = true; if (--nr == 0) break; pte++; addr += PAGE_SIZE; } #endif return was_installed; } static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb, struct vm_area_struct *vma, struct folio *folio, struct page *page, pte_t *pte, pte_t ptent, unsigned int nr, unsigned long addr, struct zap_details *details, int *rss, bool *force_flush, bool *force_break, bool *any_skipped) { struct mm_struct *mm = tlb->mm; bool delay_rmap = false; if (!folio_test_anon(folio)) { ptent = get_and_clear_full_ptes(mm, addr, pte, nr, tlb->fullmm); if (pte_dirty(ptent)) { folio_mark_dirty(folio); if (tlb_delay_rmap(tlb)) { delay_rmap = true; *force_flush = true; } } if (pte_young(ptent) && likely(vma_has_recency(vma))) folio_mark_accessed(folio); rss[mm_counter(folio)] -= nr; } else { /* We don't need up-to-date accessed/dirty bits. */ clear_full_ptes(mm, addr, pte, nr, tlb->fullmm); rss[MM_ANONPAGES] -= nr; } /* Checking a single PTE in a batch is sufficient. */ arch_check_zapped_pte(vma, ptent); tlb_remove_tlb_entries(tlb, pte, nr, addr); if (unlikely(userfaultfd_pte_wp(vma, ptent))) *any_skipped = zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent); if (!delay_rmap) { folio_remove_rmap_ptes(folio, page, nr, vma); if (unlikely(folio_mapcount(folio) < 0)) print_bad_pte(vma, addr, ptent, page); } if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) { *force_flush = true; *force_break = true; } } /* * Zap or skip at least one present PTE, trying to batch-process subsequent * PTEs that map consecutive pages of the same folio. * * Returns the number of processed (skipped or zapped) PTEs (at least 1). */ static inline int zap_present_ptes(struct mmu_gather *tlb, struct vm_area_struct *vma, pte_t *pte, pte_t ptent, unsigned int max_nr, unsigned long addr, struct zap_details *details, int *rss, bool *force_flush, bool *force_break, bool *any_skipped) { struct mm_struct *mm = tlb->mm; struct folio *folio; struct page *page; int nr; page = vm_normal_page(vma, addr, ptent); if (!page) { /* We don't need up-to-date accessed/dirty bits. */ ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); arch_check_zapped_pte(vma, ptent); tlb_remove_tlb_entry(tlb, pte, addr); if (userfaultfd_pte_wp(vma, ptent)) *any_skipped = zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent); ksm_might_unmap_zero_page(mm, ptent); return 1; } folio = page_folio(page); if (unlikely(!should_zap_folio(details, folio))) { *any_skipped = true; return 1; } /* * Make sure that the common "small folio" case is as fast as possible * by keeping the batching logic separate. */ if (unlikely(folio_test_large(folio) && max_nr != 1)) { nr = folio_pte_batch(folio, pte, ptent, max_nr); zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr, addr, details, rss, force_flush, force_break, any_skipped); return nr; } zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, 1, addr, details, rss, force_flush, force_break, any_skipped); return 1; } static inline int zap_nonpresent_ptes(struct mmu_gather *tlb, struct vm_area_struct *vma, pte_t *pte, pte_t ptent, unsigned int max_nr, unsigned long addr, struct zap_details *details, int *rss, bool *any_skipped) { swp_entry_t entry; int nr = 1; *any_skipped = true; entry = pte_to_swp_entry(ptent); if (is_device_private_entry(entry) || is_device_exclusive_entry(entry)) { struct page *page = pfn_swap_entry_to_page(entry); struct folio *folio = page_folio(page); if (unlikely(!should_zap_folio(details, folio))) return 1; /* * Both device private/exclusive mappings should only * work with anonymous page so far, so we don't need to * consider uffd-wp bit when zap. For more information, * see zap_install_uffd_wp_if_needed(). */ WARN_ON_ONCE(!vma_is_anonymous(vma)); rss[mm_counter(folio)]--; folio_remove_rmap_pte(folio, page, vma); folio_put(folio); } else if (!non_swap_entry(entry)) { /* Genuine swap entries, hence a private anon pages */ if (!should_zap_cows(details)) return 1; nr = swap_pte_batch(pte, max_nr, ptent); rss[MM_SWAPENTS] -= nr; free_swap_and_cache_nr(entry, nr); } else if (is_migration_entry(entry)) { struct folio *folio = pfn_swap_entry_folio(entry); if (!should_zap_folio(details, folio)) return 1; rss[mm_counter(folio)]--; } else if (pte_marker_entry_uffd_wp(entry)) { /* * For anon: always drop the marker; for file: only * drop the marker if explicitly requested. */ if (!vma_is_anonymous(vma) && !zap_drop_markers(details)) return 1; } else if (is_guard_swp_entry(entry)) { /* * Ordinary zapping should not remove guard PTE * markers. Only do so if we should remove PTE markers * in general. */ if (!zap_drop_markers(details)) return 1; } else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) { if (!should_zap_cows(details)) return 1; } else { /* We should have covered all the swap entry types */ pr_alert("unrecognized swap entry 0x%lx\n", entry.val); WARN_ON_ONCE(1); } clear_not_present_full_ptes(vma->vm_mm, addr, pte, nr, tlb->fullmm); *any_skipped = zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent); return nr; } static inline int do_zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pte_t *pte, unsigned long addr, unsigned long end, struct zap_details *details, int *rss, bool *force_flush, bool *force_break, bool *any_skipped) { pte_t ptent = ptep_get(pte); int max_nr = (end - addr) / PAGE_SIZE; int nr = 0; /* Skip all consecutive none ptes */ if (pte_none(ptent)) { for (nr = 1; nr < max_nr; nr++) { ptent = ptep_get(pte + nr); if (!pte_none(ptent)) break; } max_nr -= nr; if (!max_nr) return nr; pte += nr; addr += nr * PAGE_SIZE; } if (pte_present(ptent)) nr += zap_present_ptes(tlb, vma, pte, ptent, max_nr, addr, details, rss, force_flush, force_break, any_skipped); else nr += zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr, details, rss, any_skipped); return nr; } static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, struct zap_details *details) { bool force_flush = false, force_break = false; struct mm_struct *mm = tlb->mm; int rss[NR_MM_COUNTERS]; spinlock_t *ptl; pte_t *start_pte; pte_t *pte; pmd_t pmdval; unsigned long start = addr; bool can_reclaim_pt = reclaim_pt_is_enabled(start, end, details); bool direct_reclaim = true; int nr; retry: tlb_change_page_size(tlb, PAGE_SIZE); init_rss_vec(rss); start_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); if (!pte) return addr; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); do { bool any_skipped = false; if (need_resched()) { direct_reclaim = false; break; } nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, rss, &force_flush, &force_break, &any_skipped); if (any_skipped) can_reclaim_pt = false; if (unlikely(force_break)) { addr += nr * PAGE_SIZE; direct_reclaim = false; break; } } while (pte += nr, addr += PAGE_SIZE * nr, addr != end); /* * Fast path: try to hold the pmd lock and unmap the PTE page. * * If the pte lock was released midway (retry case), or if the attempt * to hold the pmd lock failed, then we need to recheck all pte entries * to ensure they are still none, thereby preventing the pte entries * from being repopulated by another thread. */ if (can_reclaim_pt && direct_reclaim && addr == end) direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval); add_mm_rss_vec(mm, rss); arch_leave_lazy_mmu_mode(); /* Do the actual TLB flush before dropping ptl */ if (force_flush) { tlb_flush_mmu_tlbonly(tlb); tlb_flush_rmaps(tlb, vma); } pte_unmap_unlock(start_pte, ptl); /* * If we forced a TLB flush (either due to running out of * batch buffers or because we needed to flush dirty TLB * entries before releasing the ptl), free the batched * memory too. Come back again if we didn't do everything. */ if (force_flush) tlb_flush_mmu(tlb); if (addr != end) { cond_resched(); force_flush = false; force_break = false; goto retry; } if (can_reclaim_pt) { if (direct_reclaim) free_pte(mm, start, tlb, pmdval); else try_to_free_pte(mm, pmd, start, tlb); } return addr; } static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, struct zap_details *details) { pmd_t *pmd; unsigned long next; pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd)) { if (next - addr != HPAGE_PMD_SIZE) __split_huge_pmd(vma, pmd, addr, false); else if (zap_huge_pmd(tlb, vma, pmd, addr)) { addr = next; continue; } /* fall through */ } else if (details && details->single_folio && folio_test_pmd_mappable(details->single_folio) && next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) { spinlock_t *ptl = pmd_lock(tlb->mm, pmd); /* * Take and drop THP pmd lock so that we cannot return * prematurely, while zap_huge_pmd() has cleared *pmd, * but not yet decremented compound_mapcount(). */ spin_unlock(ptl); } if (pmd_none(*pmd)) { addr = next; continue; } addr = zap_pte_range(tlb, vma, pmd, addr, next, details); if (addr != next) pmd--; } while (pmd++, cond_resched(), addr != end); return addr; } static inline unsigned long zap_pud_range(struct mmu_gather *tlb, struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr, unsigned long end, struct zap_details *details) { pud_t *pud; unsigned long next; pud = pud_offset(p4d, addr); do { next = pud_addr_end(addr, end); if (pud_trans_huge(*pud)) { if (next - addr != HPAGE_PUD_SIZE) { mmap_assert_locked(tlb->mm); split_huge_pud(vma, pud, addr); } else if (zap_huge_pud(tlb, vma, pud, addr)) goto next; /* fall through */ } if (pud_none_or_clear_bad(pud)) continue; next = zap_pmd_range(tlb, vma, pud, addr, next, details); next: cond_resched(); } while (pud++, addr = next, addr != end); return addr; } static inline unsigned long zap_p4d_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pgd_t *pgd, unsigned long addr, unsigned long end, struct zap_details *details) { p4d_t *p4d; unsigned long next; p4d = p4d_offset(pgd, addr); do { next = p4d_addr_end(addr, end); if (p4d_none_or_clear_bad(p4d)) continue; next = zap_pud_range(tlb, vma, p4d, addr, next, details); } while (p4d++, addr = next, addr != end); return addr; } void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long addr, unsigned long end, struct zap_details *details) { pgd_t *pgd; unsigned long next; BUG_ON(addr >= end); tlb_start_vma(tlb, vma); pgd = pgd_offset(vma->vm_mm, addr); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) continue; next = zap_p4d_range(tlb, vma, pgd, addr, next, details); } while (pgd++, addr = next, addr != end); tlb_end_vma(tlb, vma); } static void unmap_single_vma(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr, struct zap_details *details, bool mm_wr_locked) { unsigned long start = max(vma->vm_start, start_addr); unsigned long end; if (start >= vma->vm_end) return; end = min(vma->vm_end, end_addr); if (end <= vma->vm_start) return; if (vma->vm_file) uprobe_munmap(vma, start, end); if (start != end) { if (unlikely(is_vm_hugetlb_page(vma))) { /* * It is undesirable to test vma->vm_file as it * should be non-null for valid hugetlb area. * However, vm_file will be NULL in the error * cleanup path of mmap_region. When * hugetlbfs ->mmap method fails, * mmap_region() nullifies vma->vm_file * before calling this function to clean up. * Since no pte has actually been setup, it is * safe to do nothing in this case. */ if (vma->vm_file) { zap_flags_t zap_flags = details ? details->zap_flags : 0; __unmap_hugepage_range(tlb, vma, start, end, NULL, zap_flags); } } else unmap_page_range(tlb, vma, start, end, details); } } /** * unmap_vmas - unmap a range of memory covered by a list of vma's * @tlb: address of the caller's struct mmu_gather * @mas: the maple state * @vma: the starting vma * @start_addr: virtual address at which to start unmapping * @end_addr: virtual address at which to end unmapping * @tree_end: The maximum index to check * @mm_wr_locked: lock flag * * Unmap all pages in the vma list. * * Only addresses between `start' and `end' will be unmapped. * * The VMA list must be sorted in ascending virtual address order. * * unmap_vmas() assumes that the caller will flush the whole unmapped address * range after unmap_vmas() returns. So the only responsibility here is to * ensure that any thus-far unmapped pages are flushed before unmap_vmas() * drops the lock and schedules. */ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas, struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr, unsigned long tree_end, bool mm_wr_locked) { struct mmu_notifier_range range; struct zap_details details = { .zap_flags = ZAP_FLAG_DROP_MARKER | ZAP_FLAG_UNMAP, /* Careful - we need to zap private pages too! */ .even_cows = true, }; mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma->vm_mm, start_addr, end_addr); mmu_notifier_invalidate_range_start(&range); do { unsigned long start = start_addr; unsigned long end = end_addr; hugetlb_zap_begin(vma, &start, &end); unmap_single_vma(tlb, vma, start, end, &details, mm_wr_locked); hugetlb_zap_end(vma, &details); vma = mas_find(mas, tree_end - 1); } while (vma && likely(!xa_is_zero(vma))); mmu_notifier_invalidate_range_end(&range); } /** * zap_page_range_single_batched - remove user pages in a given range * @tlb: pointer to the caller's struct mmu_gather * @vma: vm_area_struct holding the applicable pages * @address: starting address of pages to remove * @size: number of bytes to remove * @details: details of shared cache invalidation * * @tlb shouldn't be NULL. The range must fit into one VMA. If @vma is for * hugetlb, @tlb is flushed and re-initialized by this function. */ void zap_page_range_single_batched(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *details) { const unsigned long end = address + size; struct mmu_notifier_range range; VM_WARN_ON_ONCE(!tlb || tlb->mm != vma->vm_mm); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, address, end); hugetlb_zap_begin(vma, &range.start, &range.end); update_hiwater_rss(vma->vm_mm); mmu_notifier_invalidate_range_start(&range); /* * unmap 'address-end' not 'range.start-range.end' as range * could have been expanded for hugetlb pmd sharing. */ unmap_single_vma(tlb, vma, address, end, details, false); mmu_notifier_invalidate_range_end(&range); if (is_vm_hugetlb_page(vma)) { /* * flush tlb and free resources before hugetlb_zap_end(), to * avoid concurrent page faults' allocation failure. */ tlb_finish_mmu(tlb); hugetlb_zap_end(vma, details); tlb_gather_mmu(tlb, vma->vm_mm); } } /** * zap_page_range_single - remove user pages in a given range * @vma: vm_area_struct holding the applicable pages * @address: starting address of pages to zap * @size: number of bytes to zap * @details: details of shared cache invalidation * * The range must fit into one VMA. */ void zap_page_range_single(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *details) { struct mmu_gather tlb; tlb_gather_mmu(&tlb, vma->vm_mm); zap_page_range_single_batched(&tlb, vma, address, size, details); tlb_finish_mmu(&tlb); } /** * zap_vma_ptes - remove ptes mapping the vma * @vma: vm_area_struct holding ptes to be zapped * @address: starting address of pages to zap * @size: number of bytes to zap * * This function only unmaps ptes assigned to VM_PFNMAP vmas. * * The entire address range must be fully contained within the vma. * */ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, unsigned long size) { if (!range_in_vma(vma, address, address + size) || !(vma->vm_flags & VM_PFNMAP)) return; zap_page_range_single(vma, address, size, NULL); } EXPORT_SYMBOL_GPL(zap_vma_ptes); static pmd_t *walk_to_pmd(struct mm_struct *mm, unsigned long addr) { pgd_t *pgd; p4d_t *p4d; pud_t *pud; pmd_t *pmd; pgd = pgd_offset(mm, addr); p4d = p4d_alloc(mm, pgd, addr); if (!p4d) return NULL; pud = pud_alloc(mm, p4d, addr); if (!pud) return NULL; pmd = pmd_alloc(mm, pud, addr); if (!pmd) return NULL; VM_BUG_ON(pmd_trans_huge(*pmd)); return pmd; } pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr, spinlock_t **ptl) { pmd_t *pmd = walk_to_pmd(mm, addr); if (!pmd) return NULL; return pte_alloc_map_lock(mm, pmd, addr, ptl); } static bool vm_mixed_zeropage_allowed(struct vm_area_struct *vma) { VM_WARN_ON_ONCE(vma->vm_flags & VM_PFNMAP); /* * Whoever wants to forbid the zeropage after some zeropages * might already have been mapped has to scan the page tables and * bail out on any zeropages. Zeropages in COW mappings can * be unshared using FAULT_FLAG_UNSHARE faults. */ if (mm_forbids_zeropage(vma->vm_mm)) return false; /* zeropages in COW mappings are common and unproblematic. */ if (is_cow_mapping(vma->vm_flags)) return true; /* Mappings that do not allow for writable PTEs are unproblematic. */ if (!(vma->vm_flags & (VM_WRITE | VM_MAYWRITE))) return true; /* * Why not allow any VMA that has vm_ops->pfn_mkwrite? GUP could * find the shared zeropage and longterm-pin it, which would * be problematic as soon as the zeropage gets replaced by a different * page due to vma->vm_ops->pfn_mkwrite, because what's mapped would * now differ to what GUP looked up. FSDAX is incompatible to * FOLL_LONGTERM and VM_IO is incompatible to GUP completely (see * check_vma_flags). */ return vma->vm_ops && vma->vm_ops->pfn_mkwrite && (vma_is_fsdax(vma) || vma->vm_flags & VM_IO); } static int validate_page_before_insert(struct vm_area_struct *vma, struct page *page) { struct folio *folio = page_folio(page); if (!folio_ref_count(folio)) return -EINVAL; if (unlikely(is_zero_folio(folio))) { if (!vm_mixed_zeropage_allowed(vma)) return -EINVAL; return 0; } if (folio_test_anon(folio) || page_has_type(page)) return -EINVAL; flush_dcache_folio(folio); return 0; } static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, unsigned long addr, struct page *page, pgprot_t prot, bool mkwrite) { struct folio *folio = page_folio(page); pte_t pteval = ptep_get(pte); if (!pte_none(pteval)) { if (!mkwrite) return -EBUSY; /* see insert_pfn(). */ if (pte_pfn(pteval) != page_to_pfn(page)) { WARN_ON_ONCE(!is_zero_pfn(pte_pfn(pteval))); return -EFAULT; } pteval = maybe_mkwrite(pteval, vma); pteval = pte_mkyoung(pteval); if (ptep_set_access_flags(vma, addr, pte, pteval, 1)) update_mmu_cache(vma, addr, pte); return 0; } /* Ok, finally just insert the thing.. */ pteval = mk_pte(page, prot); if (unlikely(is_zero_folio(folio))) { pteval = pte_mkspecial(pteval); } else { folio_get(folio); pteval = mk_pte(page, prot); if (mkwrite) { pteval = pte_mkyoung(pteval); pteval = maybe_mkwrite(pte_mkdirty(pteval), vma); } inc_mm_counter(vma->vm_mm, mm_counter_file(folio)); folio_add_file_rmap_pte(folio, page, vma); } set_pte_at(vma->vm_mm, addr, pte, pteval); return 0; } static int insert_page(struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot, bool mkwrite) { int retval; pte_t *pte; spinlock_t *ptl; retval = validate_page_before_insert(vma, page); if (retval) goto out; retval = -ENOMEM; pte = get_locked_pte(vma->vm_mm, addr, &ptl); if (!pte) goto out; retval = insert_page_into_pte_locked(vma, pte, addr, page, prot, mkwrite); pte_unmap_unlock(pte, ptl); out: return retval; } static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte, unsigned long addr, struct page *page, pgprot_t prot) { int err; err = validate_page_before_insert(vma, page); if (err) return err; return insert_page_into_pte_locked(vma, pte, addr, page, prot, false); } /* insert_pages() amortizes the cost of spinlock operations * when inserting pages in a loop. */ static int insert_pages(struct vm_area_struct *vma, unsigned long addr, struct page **pages, unsigned long *num, pgprot_t prot) { pmd_t *pmd = NULL; pte_t *start_pte, *pte; spinlock_t *pte_lock; struct mm_struct *const mm = vma->vm_mm; unsigned long curr_page_idx = 0; unsigned long remaining_pages_total = *num; unsigned long pages_to_write_in_pmd; int ret; more: ret = -EFAULT; pmd = walk_to_pmd(mm, addr); if (!pmd) goto out; pages_to_write_in_pmd = min_t(unsigned long, remaining_pages_total, PTRS_PER_PTE - pte_index(addr)); /* Allocate the PTE if necessary; takes PMD lock once only. */ ret = -ENOMEM; if (pte_alloc(mm, pmd)) goto out; while (pages_to_write_in_pmd) { int pte_idx = 0; const int batch_size = min_t(int, pages_to_write_in_pmd, 8); start_pte = pte_offset_map_lock(mm, pmd, addr, &pte_lock); if (!start_pte) { ret = -EFAULT; goto out; } for (pte = start_pte; pte_idx < batch_size; ++pte, ++pte_idx) { int err = insert_page_in_batch_locked(vma, pte, addr, pages[curr_page_idx], prot); if (unlikely(err)) { pte_unmap_unlock(start_pte, pte_lock); ret = err; remaining_pages_total -= pte_idx; goto out; } addr += PAGE_SIZE; ++curr_page_idx; } pte_unmap_unlock(start_pte, pte_lock); pages_to_write_in_pmd -= batch_size; remaining_pages_total -= batch_size; } if (remaining_pages_total) goto more; ret = 0; out: *num = remaining_pages_total; return ret; } /** * vm_insert_pages - insert multiple pages into user vma, batching the pmd lock. * @vma: user vma to map to * @addr: target start user address of these pages * @pages: source kernel pages * @num: in: number of pages to map. out: number of pages that were *not* * mapped. (0 means all pages were successfully mapped). * * Preferred over vm_insert_page() when inserting multiple pages. * * In case of error, we may have mapped a subset of the provided * pages. It is the caller's responsibility to account for this case. * * The same restrictions apply as in vm_insert_page(). */ int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr, struct page **pages, unsigned long *num) { const unsigned long end_addr = addr + (*num * PAGE_SIZE) - 1; if (addr < vma->vm_start || end_addr >= vma->vm_end) return -EFAULT; if (!(vma->vm_flags & VM_MIXEDMAP)) { BUG_ON(mmap_read_trylock(vma->vm_mm)); BUG_ON(vma->vm_flags & VM_PFNMAP); vm_flags_set(vma, VM_MIXEDMAP); } /* Defer page refcount checking till we're about to map that page. */ return insert_pages(vma, addr, pages, num, vma->vm_page_prot); } EXPORT_SYMBOL(vm_insert_pages); /** * vm_insert_page - insert single page into user vma * @vma: user vma to map to * @addr: target user address of this page * @page: source kernel page * * This allows drivers to insert individual pages they've allocated * into a user vma. The zeropage is supported in some VMAs, * see vm_mixed_zeropage_allowed(). * * The page has to be a nice clean _individual_ kernel allocation. * If you allocate a compound page, you need to have marked it as * such (__GFP_COMP), or manually just split the page up yourself * (see split_page()). * * NOTE! Traditionally this was done with "remap_pfn_range()" which * took an arbitrary page protection parameter. This doesn't allow * that. Your vma protection will have to be set up correctly, which * means that if you want a shared writable mapping, you'd better * ask for a shared writable mapping! * * The page does not need to be reserved. * * Usually this function is called from f_op->mmap() handler * under mm->mmap_lock write-lock, so it can change vma->vm_flags. * Caller must set VM_MIXEDMAP on vma if it wants to call this * function from other places, for example from page-fault handler. * * Return: %0 on success, negative error code otherwise. */ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr, struct page *page) { if (addr < vma->vm_start || addr >= vma->vm_end) return -EFAULT; if (!(vma->vm_flags & VM_MIXEDMAP)) { BUG_ON(mmap_read_trylock(vma->vm_mm)); BUG_ON(vma->vm_flags & VM_PFNMAP); vm_flags_set(vma, VM_MIXEDMAP); } return insert_page(vma, addr, page, vma->vm_page_prot, false); } EXPORT_SYMBOL(vm_insert_page); /* * __vm_map_pages - maps range of kernel pages into user vma * @vma: user vma to map to * @pages: pointer to array of source kernel pages * @num: number of pages in page array * @offset: user's requested vm_pgoff * * This allows drivers to map range of kernel pages into a user vma. * The zeropage is supported in some VMAs, see * vm_mixed_zeropage_allowed(). * * Return: 0 on success and error code otherwise. */ static int __vm_map_pages(struct vm_area_struct *vma, struct page **pages, unsigned long num, unsigned long offset) { unsigned long count = vma_pages(vma); unsigned long uaddr = vma->vm_start; int ret, i; /* Fail if the user requested offset is beyond the end of the object */ if (offset >= num) return -ENXIO; /* Fail if the user requested size exceeds available object size */ if (count > num - offset) return -ENXIO; for (i = 0; i < count; i++) { ret = vm_insert_page(vma, uaddr, pages[offset + i]); if (ret < 0) return ret; uaddr += PAGE_SIZE; } return 0; } /** * vm_map_pages - maps range of kernel pages starts with non zero offset * @vma: user vma to map to * @pages: pointer to array of source kernel pages * @num: number of pages in page array * * Maps an object consisting of @num pages, catering for the user's * requested vm_pgoff * * If we fail to insert any page into the vma, the function will return * immediately leaving any previously inserted pages present. Callers * from the mmap handler may immediately return the error as their caller * will destroy the vma, removing any successfully inserted pages. Other * callers should make their own arrangements for calling unmap_region(). * * Context: Process context. Called by mmap handlers. * Return: 0 on success and error code otherwise. */ int vm_map_pages(struct vm_area_struct *vma, struct page **pages, unsigned long num) { return __vm_map_pages(vma, pages, num, vma->vm_pgoff); } EXPORT_SYMBOL(vm_map_pages); /** * vm_map_pages_zero - map range of kernel pages starts with zero offset * @vma: user vma to map to * @pages: pointer to array of source kernel pages * @num: number of pages in page array * * Similar to vm_map_pages(), except that it explicitly sets the offset * to 0. This function is intended for the drivers that did not consider * vm_pgoff. * * Context: Process context. Called by mmap handlers. * Return: 0 on success and error code otherwise. */ int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages, unsigned long num) { return __vm_map_pages(vma, pages, num, 0); } EXPORT_SYMBOL(vm_map_pages_zero); static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, pgprot_t prot, bool mkwrite) { struct mm_struct *mm = vma->vm_mm; pte_t *pte, entry; spinlock_t *ptl; pte = get_locked_pte(mm, addr, &ptl); if (!pte) return VM_FAULT_OOM; entry = ptep_get(pte); if (!pte_none(entry)) { if (mkwrite) { /* * For read faults on private mappings the PFN passed * in may not match the PFN we have mapped if the * mapped PFN is a writeable COW page. In the mkwrite * case we are creating a writable PTE for a shared * mapping and we expect the PFNs to match. If they * don't match, we are likely racing with block * allocation and mapping invalidation so just skip the * update. */ if (pte_pfn(entry) != pfn) { WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry))); goto out_unlock; } entry = pte_mkyoung(entry); entry = maybe_mkwrite(pte_mkdirty(entry), vma); if (ptep_set_access_flags(vma, addr, pte, entry, 1)) update_mmu_cache(vma, addr, pte); } goto out_unlock; } /* Ok, finally just insert the thing.. */ entry = pte_mkspecial(pfn_pte(pfn, prot)); if (mkwrite) { entry = pte_mkyoung(entry); entry = maybe_mkwrite(pte_mkdirty(entry), vma); } set_pte_at(mm, addr, pte, entry); update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */ out_unlock: pte_unmap_unlock(pte, ptl); return VM_FAULT_NOPAGE; } /** * vmf_insert_pfn_prot - insert single pfn into user vma with specified pgprot * @vma: user vma to map to * @addr: target user address of this page * @pfn: source kernel pfn * @pgprot: pgprot flags for the inserted page * * This is exactly like vmf_insert_pfn(), except that it allows drivers * to override pgprot on a per-page basis. * * This only makes sense for IO mappings, and it makes no sense for * COW mappings. In general, using multiple vmas is preferable; * vmf_insert_pfn_prot should only be used if using multiple VMAs is * impractical. * * pgprot typically only differs from @vma->vm_page_prot when drivers set * caching- and encryption bits different than those of @vma->vm_page_prot, * because the caching- or encryption mode may not be known at mmap() time. * * This is ok as long as @vma->vm_page_prot is not used by the core vm * to set caching and encryption bits for those vmas (except for COW pages). * This is ensured by core vm only modifying these page table entries using * functions that don't touch caching- or encryption bits, using pte_modify() * if needed. (See for example mprotect()). * * Also when new page-table entries are created, this is only done using the * fault() callback, and never using the value of vma->vm_page_prot, * except for page-table entries that point to anonymous pages as the result * of COW. * * Context: Process context. May allocate using %GFP_KERNEL. * Return: vm_fault_t value. */ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, pgprot_t pgprot) { /* * Technically, architectures with pte_special can avoid all these * restrictions (same for remap_pfn_range). However we would like * consistency in testing and feature parity among all, so we should * try to keep these invariants in place for everybody. */ BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))); BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) == (VM_PFNMAP|VM_MIXEDMAP)); BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags)); BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_valid(pfn)); if (addr < vma->vm_start || addr >= vma->vm_end) return VM_FAULT_SIGBUS; if (!pfn_modify_allowed(pfn, pgprot)) return VM_FAULT_SIGBUS; pfnmap_setup_cachemode_pfn(pfn, &pgprot); return insert_pfn(vma, addr, pfn, pgprot, false); } EXPORT_SYMBOL(vmf_insert_pfn_prot); /** * vmf_insert_pfn - insert single pfn into user vma * @vma: user vma to map to * @addr: target user address of this page * @pfn: source kernel pfn * * Similar to vm_insert_page, this allows drivers to insert individual pages * they've allocated into a user vma. Same comments apply. * * This function should only be called from a vm_ops->fault handler, and * in that case the handler should return the result of this function. * * vma cannot be a COW mapping. * * As this is called only for pages that do not currently exist, we * do not need to flush old virtual caches or the TLB. * * Context: Process context. May allocate using %GFP_KERNEL. * Return: vm_fault_t value. */ vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) { return vmf_insert_pfn_prot(vma, addr, pfn, vma->vm_page_prot); } EXPORT_SYMBOL(vmf_insert_pfn); static bool vm_mixed_ok(struct vm_area_struct *vma, unsigned long pfn, bool mkwrite) { if (unlikely(is_zero_pfn(pfn)) && (mkwrite || !vm_mixed_zeropage_allowed(vma))) return false; /* these checks mirror the abort conditions in vm_normal_page */ if (vma->vm_flags & VM_MIXEDMAP) return true; if (is_zero_pfn(pfn)) return true; return false; } static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, bool mkwrite) { pgprot_t pgprot = vma->vm_page_prot; int err; if (!vm_mixed_ok(vma, pfn, mkwrite)) return VM_FAULT_SIGBUS; if (addr < vma->vm_start || addr >= vma->vm_end) return VM_FAULT_SIGBUS; pfnmap_setup_cachemode_pfn(pfn, &pgprot); if (!pfn_modify_allowed(pfn, pgprot)) return VM_FAULT_SIGBUS; /* * If we don't have pte special, then we have to use the pfn_valid() * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must* * refcount the page if pfn_valid is true (hence insert_page rather * than insert_pfn). If a zero_pfn were inserted into a VM_MIXEDMAP * without pte special, it would there be refcounted as a normal page. */ if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pfn_valid(pfn)) { struct page *page; /* * At this point we are committed to insert_page() * regardless of whether the caller specified flags that * result in pfn_t_has_page() == false. */ page = pfn_to_page(pfn); err = insert_page(vma, addr, page, pgprot, mkwrite); } else { return insert_pfn(vma, addr, pfn, pgprot, mkwrite); } if (err == -ENOMEM) return VM_FAULT_OOM; if (err < 0 && err != -EBUSY) return VM_FAULT_SIGBUS; return VM_FAULT_NOPAGE; } vm_fault_t vmf_insert_page_mkwrite(struct vm_fault *vmf, struct page *page, bool write) { pgprot_t pgprot = vmf->vma->vm_page_prot; unsigned long addr = vmf->address; int err; if (addr < vmf->vma->vm_start || addr >= vmf->vma->vm_end) return VM_FAULT_SIGBUS; err = insert_page(vmf->vma, addr, page, pgprot, write); if (err == -ENOMEM) return VM_FAULT_OOM; if (err < 0 && err != -EBUSY) return VM_FAULT_SIGBUS; return VM_FAULT_NOPAGE; } EXPORT_SYMBOL_GPL(vmf_insert_page_mkwrite); vm_fault_t vmf_insert_mixed(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) { return __vm_insert_mixed(vma, addr, pfn, false); } EXPORT_SYMBOL(vmf_insert_mixed); /* * If the insertion of PTE failed because someone else already added a * different entry in the mean time, we treat that as success as we assume * the same entry was actually inserted. */ vm_fault_t vmf_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) { return __vm_insert_mixed(vma, addr, pfn, true); } /* * maps a range of physical memory into the requested pages. the old * mappings are removed. any references to nonexistent pages results * in null mappings (currently treated as "copy-on-access") */ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, unsigned long end, unsigned long pfn, pgprot_t prot) { pte_t *pte, *mapped_pte; spinlock_t *ptl; int err = 0; mapped_pte = pte = pte_alloc_map_lock(mm, pmd, addr, &ptl); if (!pte) return -ENOMEM; arch_enter_lazy_mmu_mode(); do { BUG_ON(!pte_none(ptep_get(pte))); if (!pfn_modify_allowed(pfn, prot)) { err = -EACCES; break; } set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot))); pfn++; } while (pte++, addr += PAGE_SIZE, addr != end); arch_leave_lazy_mmu_mode(); pte_unmap_unlock(mapped_pte, ptl); return err; } static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud, unsigned long addr, unsigned long end, unsigned long pfn, pgprot_t prot) { pmd_t *pmd; unsigned long next; int err; pfn -= addr >> PAGE_SHIFT; pmd = pmd_alloc(mm, pud, addr); if (!pmd) return -ENOMEM; VM_BUG_ON(pmd_trans_huge(*pmd)); do { next = pmd_addr_end(addr, end); err = remap_pte_range(mm, pmd, addr, next, pfn + (addr >> PAGE_SHIFT), prot); if (err) return err; } while (pmd++, addr = next, addr != end); return 0; } static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d, unsigned long addr, unsigned long end, unsigned long pfn, pgprot_t prot) { pud_t *pud; unsigned long next; int err; pfn -= addr >> PAGE_SHIFT; pud = pud_alloc(mm, p4d, addr); if (!pud) return -ENOMEM; do { next = pud_addr_end(addr, end); err = remap_pmd_range(mm, pud, addr, next, pfn + (addr >> PAGE_SHIFT), prot); if (err) return err; } while (pud++, addr = next, addr != end); return 0; } static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd, unsigned long addr, unsigned long end, unsigned long pfn, pgprot_t prot) { p4d_t *p4d; unsigned long next; int err; pfn -= addr >> PAGE_SHIFT; p4d = p4d_alloc(mm, pgd, addr); if (!p4d) return -ENOMEM; do { next = p4d_addr_end(addr, end); err = remap_pud_range(mm, p4d, addr, next, pfn + (addr >> PAGE_SHIFT), prot); if (err) return err; } while (p4d++, addr = next, addr != end); return 0; } static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t prot) { pgd_t *pgd; unsigned long next; unsigned long end = addr + PAGE_ALIGN(size); struct mm_struct *mm = vma->vm_mm; int err; if (WARN_ON_ONCE(!PAGE_ALIGNED(addr))) return -EINVAL; /* * Physically remapped pages are special. Tell the * rest of the world about it: * VM_IO tells people not to look at these pages * (accesses can have side effects). * VM_PFNMAP tells the core MM that the base pages are just * raw PFN mappings, and do not have a "struct page" associated * with them. * VM_DONTEXPAND * Disable vma merging and expanding with mremap(). * VM_DONTDUMP * Omit vma from core dump, even when VM_IO turned off. * * There's a horrible special case to handle copy-on-write * behaviour that some programs depend on. We mark the "original" * un-COW'ed pages by matching them up with "vma->vm_pgoff". * See vm_normal_page() for details. */ if (is_cow_mapping(vma->vm_flags)) { if (addr != vma->vm_start || end != vma->vm_end) return -EINVAL; vma->vm_pgoff = pfn; } vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP); BUG_ON(addr >= end); pfn -= addr >> PAGE_SHIFT; pgd = pgd_offset(mm, addr); flush_cache_range(vma, addr, end); do { next = pgd_addr_end(addr, end); err = remap_p4d_range(mm, pgd, addr, next, pfn + (addr >> PAGE_SHIFT), prot); if (err) return err; } while (pgd++, addr = next, addr != end); return 0; } /* * Variant of remap_pfn_range that does not call track_pfn_remap. The caller * must have pre-validated the caching bits of the pgprot_t. */ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t prot) { int error = remap_pfn_range_internal(vma, addr, pfn, size, prot); if (!error) return 0; /* * A partial pfn range mapping is dangerous: it does not * maintain page reference counts, and callers may free * pages due to the error. So zap it early. */ zap_page_range_single(vma, addr, size, NULL); return error; } #ifdef __HAVE_PFNMAP_TRACKING static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn, unsigned long size, pgprot_t *prot) { struct pfnmap_track_ctx *ctx; if (pfnmap_track(pfn, size, prot)) return ERR_PTR(-EINVAL); ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); if (unlikely(!ctx)) { pfnmap_untrack(pfn, size); return ERR_PTR(-ENOMEM); } ctx->pfn = pfn; ctx->size = size; kref_init(&ctx->kref); return ctx; } void pfnmap_track_ctx_release(struct kref *ref) { struct pfnmap_track_ctx *ctx = container_of(ref, struct pfnmap_track_ctx, kref); pfnmap_untrack(ctx->pfn, ctx->size); kfree(ctx); } #endif /* __HAVE_PFNMAP_TRACKING */ /** * remap_pfn_range - remap kernel memory to userspace * @vma: user vma to map to * @addr: target page aligned user address to start at * @pfn: page frame number of kernel physical memory address * @size: size of mapping area * @prot: page protection flags for this mapping * * Note: this is only safe if the mm semaphore is held when called. * * Return: %0 on success, negative error code otherwise. */ #ifdef __HAVE_PFNMAP_TRACKING int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t prot) { struct pfnmap_track_ctx *ctx = NULL; int err; size = PAGE_ALIGN(size); /* * If we cover the full VMA, we'll perform actual tracking, and * remember to untrack when the last reference to our tracking * context from a VMA goes away. We'll keep tracking the whole pfn * range even during VMA splits and partial unmapping. * * If we only cover parts of the VMA, we'll only setup the cachemode * in the pgprot for the pfn range. */ if (addr == vma->vm_start && addr + size == vma->vm_end) { if (vma->pfnmap_track_ctx) return -EINVAL; ctx = pfnmap_track_ctx_alloc(pfn, size, &prot); if (IS_ERR(ctx)) return PTR_ERR(ctx); } else if (pfnmap_setup_cachemode(pfn, size, &prot)) { return -EINVAL; } err = remap_pfn_range_notrack(vma, addr, pfn, size, prot); if (ctx) { if (err) kref_put(&ctx->kref, pfnmap_track_ctx_release); else vma->pfnmap_track_ctx = ctx; } return err; } #else int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t prot) { return remap_pfn_range_notrack(vma, addr, pfn, size, prot); } #endif EXPORT_SYMBOL(remap_pfn_range); /** * vm_iomap_memory - remap memory to userspace * @vma: user vma to map to * @start: start of the physical memory to be mapped * @len: size of area * * This is a simplified io_remap_pfn_range() for common driver use. The * driver just needs to give us the physical memory range to be mapped, * we'll figure out the rest from the vma information. * * NOTE! Some drivers might want to tweak vma->vm_page_prot first to get * whatever write-combining details or similar. * * Return: %0 on success, negative error code otherwise. */ int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len) { unsigned long vm_len, pfn, pages; /* Check that the physical memory area passed in looks valid */ if (start + len < start) return -EINVAL; /* * You *really* shouldn't map things that aren't page-aligned, * but we've historically allowed it because IO memory might * just have smaller alignment. */ len += start & ~PAGE_MASK; pfn = start >> PAGE_SHIFT; pages = (len + ~PAGE_MASK) >> PAGE_SHIFT; if (pfn + pages < pfn) return -EINVAL; /* We start the mapping 'vm_pgoff' pages into the area */ if (vma->vm_pgoff > pages) return -EINVAL; pfn += vma->vm_pgoff; pages -= vma->vm_pgoff; /* Can we fit all of the mapping? */ vm_len = vma->vm_end - vma->vm_start; if (vm_len >> PAGE_SHIFT > pages) return -EINVAL; /* Ok, let it rip */ return io_remap_pfn_range(vma, vma->vm_start, pfn, vm_len, vma->vm_page_prot); } EXPORT_SYMBOL(vm_iomap_memory); static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, unsigned long end, pte_fn_t fn, void *data, bool create, pgtbl_mod_mask *mask) { pte_t *pte, *mapped_pte; int err = 0; spinlock_t *ptl; if (create) { mapped_pte = pte = (mm == &init_mm) ? pte_alloc_kernel_track(pmd, addr, mask) : pte_alloc_map_lock(mm, pmd, addr, &ptl); if (!pte) return -ENOMEM; } else { mapped_pte = pte = (mm == &init_mm) ? pte_offset_kernel(pmd, addr) : pte_offset_map_lock(mm, pmd, addr, &ptl); if (!pte) return -EINVAL; } arch_enter_lazy_mmu_mode(); if (fn) { do { if (create || !pte_none(ptep_get(pte))) { err = fn(pte, addr, data); if (err) break; } } while (pte++, addr += PAGE_SIZE, addr != end); } *mask |= PGTBL_PTE_MODIFIED; arch_leave_lazy_mmu_mode(); if (mm != &init_mm) pte_unmap_unlock(mapped_pte, ptl); return err; } static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud, unsigned long addr, unsigned long end, pte_fn_t fn, void *data, bool create, pgtbl_mod_mask *mask) { pmd_t *pmd; unsigned long next; int err = 0; BUG_ON(pud_leaf(*pud)); if (create) { pmd = pmd_alloc_track(mm, pud, addr, mask); if (!pmd) return -ENOMEM; } else { pmd = pmd_offset(pud, addr); } do { next = pmd_addr_end(addr, end); if (pmd_none(*pmd) && !create) continue; if (WARN_ON_ONCE(pmd_leaf(*pmd))) return -EINVAL; if (!pmd_none(*pmd) && WARN_ON_ONCE(pmd_bad(*pmd))) { if (!create) continue; pmd_clear_bad(pmd); } err = apply_to_pte_range(mm, pmd, addr, next, fn, data, create, mask); if (err) break; } while (pmd++, addr = next, addr != end); return err; } static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d, unsigned long addr, unsigned long end, pte_fn_t fn, void *data, bool create, pgtbl_mod_mask *mask) { pud_t *pud; unsigned long next; int err = 0; if (create) { pud = pud_alloc_track(mm, p4d, addr, mask); if (!pud) return -ENOMEM; } else { pud = pud_offset(p4d, addr); } do { next = pud_addr_end(addr, end); if (pud_none(*pud) && !create) continue; if (WARN_ON_ONCE(pud_leaf(*pud))) return -EINVAL; if (!pud_none(*pud) && WARN_ON_ONCE(pud_bad(*pud))) { if (!create) continue; pud_clear_bad(pud); } err = apply_to_pmd_range(mm, pud, addr, next, fn, data, create, mask); if (err) break; } while (pud++, addr = next, addr != end); return err; } static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd, unsigned long addr, unsigned long end, pte_fn_t fn, void *data, bool create, pgtbl_mod_mask *mask) { p4d_t *p4d; unsigned long next; int err = 0; if (create) { p4d = p4d_alloc_track(mm, pgd, addr, mask); if (!p4d) return -ENOMEM; } else { p4d = p4d_offset(pgd, addr); } do { next = p4d_addr_end(addr, end); if (p4d_none(*p4d) && !create) continue; if (WARN_ON_ONCE(p4d_leaf(*p4d))) return -EINVAL; if (!p4d_none(*p4d) && WARN_ON_ONCE(p4d_bad(*p4d))) { if (!create) continue; p4d_clear_bad(p4d); } err = apply_to_pud_range(mm, p4d, addr, next, fn, data, create, mask); if (err) break; } while (p4d++, addr = next, addr != end); return err; } static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr, unsigned long size, pte_fn_t fn, void *data, bool create) { pgd_t *pgd; unsigned long start = addr, next; unsigned long end = addr + size; pgtbl_mod_mask mask = 0; int err = 0; if (WARN_ON(addr >= end)) return -EINVAL; pgd = pgd_offset(mm, addr); do { next = pgd_addr_end(addr, end); if (pgd_none(*pgd) && !create) continue; if (WARN_ON_ONCE(pgd_leaf(*pgd))) { err = -EINVAL; break; } if (!pgd_none(*pgd) && WARN_ON_ONCE(pgd_bad(*pgd))) { if (!create) continue; pgd_clear_bad(pgd); } err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create, &mask); if (err) break; } while (pgd++, addr = next, addr != end); if (mask & ARCH_PAGE_TABLE_SYNC_MASK) arch_sync_kernel_mappings(start, start + size); return err; } /* * Scan a region of virtual memory, filling in page tables as necessary * and calling a provided function on each leaf page table. */ int apply_to_page_range(struct mm_struct *mm, unsigned long addr, unsigned long size, pte_fn_t fn, void *data) { return __apply_to_page_range(mm, addr, size, fn, data, true); } EXPORT_SYMBOL_GPL(apply_to_page_range); /* * Scan a region of virtual memory, calling a provided function on * each leaf page table where it exists. * * Unlike apply_to_page_range, this does _not_ fill in page tables * where they are absent. */ int apply_to_existing_page_range(struct mm_struct *mm, unsigned long addr, unsigned long size, pte_fn_t fn, void *data) { return __apply_to_page_range(mm, addr, size, fn, data, false); } /* * handle_pte_fault chooses page fault handler according to an entry which was * read non-atomically. Before making any commitment, on those architectures * or configurations (e.g. i386 with PAE) which might give a mix of unmatched * parts, do_swap_page must check under lock before unmapping the pte and * proceeding (but do_wp_page is only called after already making such a check; * and do_anonymous_page can safely check later on). */ static inline int pte_unmap_same(struct vm_fault *vmf) { int same = 1; #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION) if (sizeof(pte_t) > sizeof(unsigned long)) { spin_lock(vmf->ptl); same = pte_same(ptep_get(vmf->pte), vmf->orig_pte); spin_unlock(vmf->ptl); } #endif pte_unmap(vmf->pte); vmf->pte = NULL; return same; } /* * Return: * 0: copied succeeded * -EHWPOISON: copy failed due to hwpoison in source page * -EAGAIN: copied failed (some other reason) */ static inline int __wp_page_copy_user(struct page *dst, struct page *src, struct vm_fault *vmf) { int ret; void *kaddr; void __user *uaddr; struct vm_area_struct *vma = vmf->vma; struct mm_struct *mm = vma->vm_mm; unsigned long addr = vmf->address; if (likely(src)) { if (copy_mc_user_highpage(dst, src, addr, vma)) return -EHWPOISON; return 0; } /* * If the source page was a PFN mapping, we don't have * a "struct page" for it. We do a best-effort copy by * just copying from the original user address. If that * fails, we just zero-fill it. Live with it. */ kaddr = kmap_local_page(dst); pagefault_disable(); uaddr = (void __user *)(addr & PAGE_MASK); /* * On architectures with software "accessed" bits, we would * take a double page fault, so mark it accessed here. */ vmf->pte = NULL; if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) { pte_t entry; vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl); if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { /* * Other thread has already handled the fault * and update local tlb only */ if (vmf->pte) update_mmu_tlb(vma, addr, vmf->pte); ret = -EAGAIN; goto pte_unlock; } entry = pte_mkyoung(vmf->orig_pte); if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0)) update_mmu_cache_range(vmf, vma, addr, vmf->pte, 1); } /* * This really shouldn't fail, because the page is there * in the page tables. But it might just be unreadable, * in which case we just give up and fill the result with * zeroes. */ if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) { if (vmf->pte) goto warn; /* Re-validate under PTL if the page is still mapped */ vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl); if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { /* The PTE changed under us, update local tlb */ if (vmf->pte) update_mmu_tlb(vma, addr, vmf->pte); ret = -EAGAIN; goto pte_unlock; } /* * The same page can be mapped back since last copy attempt. * Try to copy again under PTL. */ if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) { /* * Give a warn in case there can be some obscure * use-case */ warn: WARN_ON_ONCE(1); clear_page(kaddr); } } ret = 0; pte_unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); pagefault_enable(); kunmap_local(kaddr); flush_dcache_page(dst); return ret; } static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma) { struct file *vm_file = vma->vm_file; if (vm_file) return mapping_gfp_mask(vm_file->f_mapping) | __GFP_FS | __GFP_IO; /* * Special mappings (e.g. VDSO) do not have any file so fake * a default GFP_KERNEL for them. */ return GFP_KERNEL; } /* * Notify the address space that the page is about to become writable so that * it can prohibit this or wait for the page to get into an appropriate state. * * We do this without the lock held, so that it can sleep if it needs to. */ static vm_fault_t do_page_mkwrite(struct vm_fault *vmf, struct folio *folio) { vm_fault_t ret; unsigned int old_flags = vmf->flags; vmf->flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE; if (vmf->vma->vm_file && IS_SWAPFILE(vmf->vma->vm_file->f_mapping->host)) return VM_FAULT_SIGBUS; ret = vmf->vma->vm_ops->page_mkwrite(vmf); /* Restore original flags so that caller is not surprised */ vmf->flags = old_flags; if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) return ret; if (unlikely(!(ret & VM_FAULT_LOCKED))) { folio_lock(folio); if (!folio->mapping) { folio_unlock(folio); return 0; /* retry */ } ret |= VM_FAULT_LOCKED; } else VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); return ret; } /* * Handle dirtying of a page in shared file mapping on a write fault. * * The function expects the page to be locked and unlocks it. */ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct address_space *mapping; struct folio *folio = page_folio(vmf->page); bool dirtied; bool page_mkwrite = vma->vm_ops && vma->vm_ops->page_mkwrite; dirtied = folio_mark_dirty(folio); VM_BUG_ON_FOLIO(folio_test_anon(folio), folio); /* * Take a local copy of the address_space - folio.mapping may be zeroed * by truncate after folio_unlock(). The address_space itself remains * pinned by vma->vm_file's reference. We rely on folio_unlock()'s * release semantics to prevent the compiler from undoing this copying. */ mapping = folio_raw_mapping(folio); folio_unlock(folio); if (!page_mkwrite) file_update_time(vma->vm_file); /* * Throttle page dirtying rate down to writeback speed. * * mapping may be NULL here because some device drivers do not * set page.mapping but still dirty their pages * * Drop the mmap_lock before waiting on IO, if we can. The file * is pinning the mapping, as per above. */ if ((dirtied || page_mkwrite) && mapping) { struct file *fpin; fpin = maybe_unlock_mmap_for_io(vmf, NULL); balance_dirty_pages_ratelimited(mapping); if (fpin) { fput(fpin); return VM_FAULT_COMPLETED; } } return 0; } /* * Handle write page faults for pages that can be reused in the current vma * * This can happen either due to the mapping being with the VM_SHARED flag, * or due to us being the last reference standing to the page. In either * case, all we need to do here is to mark the page as writable and update * any related book-keeping. */ static inline void wp_page_reuse(struct vm_fault *vmf, struct folio *folio) __releases(vmf->ptl) { struct vm_area_struct *vma = vmf->vma; pte_t entry; VM_BUG_ON(!(vmf->flags & FAULT_FLAG_WRITE)); VM_WARN_ON(is_zero_pfn(pte_pfn(vmf->orig_pte))); if (folio) { VM_BUG_ON(folio_test_anon(folio) && !PageAnonExclusive(vmf->page)); /* * Clear the folio's cpupid information as the existing * information potentially belongs to a now completely * unrelated process. */ folio_xchg_last_cpupid(folio, (1 << LAST_CPUPID_SHIFT) - 1); } flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); entry = pte_mkyoung(vmf->orig_pte); entry = maybe_mkwrite(pte_mkdirty(entry), vma); if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1)) update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); pte_unmap_unlock(vmf->pte, vmf->ptl); count_vm_event(PGREUSE); } /* * We could add a bitflag somewhere, but for now, we know that all * vm_ops that have a ->map_pages have been audited and don't need * the mmap_lock to be held. */ static inline vm_fault_t vmf_can_call_fault(const struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; if (vma->vm_ops->map_pages || !(vmf->flags & FAULT_FLAG_VMA_LOCK)) return 0; vma_end_read(vma); return VM_FAULT_RETRY; } /** * __vmf_anon_prepare - Prepare to handle an anonymous fault. * @vmf: The vm_fault descriptor passed from the fault handler. * * When preparing to insert an anonymous page into a VMA from a * fault handler, call this function rather than anon_vma_prepare(). * If this vma does not already have an associated anon_vma and we are * only protected by the per-VMA lock, the caller must retry with the * mmap_lock held. __anon_vma_prepare() will look at adjacent VMAs to * determine if this VMA can share its anon_vma, and that's not safe to * do with only the per-VMA lock held for this VMA. * * Return: 0 if fault handling can proceed. Any other value should be * returned to the caller. */ vm_fault_t __vmf_anon_prepare(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; vm_fault_t ret = 0; if (likely(vma->anon_vma)) return 0; if (vmf->flags & FAULT_FLAG_VMA_LOCK) { if (!mmap_read_trylock(vma->vm_mm)) return VM_FAULT_RETRY; } if (__anon_vma_prepare(vma)) ret = VM_FAULT_OOM; if (vmf->flags & FAULT_FLAG_VMA_LOCK) mmap_read_unlock(vma->vm_mm); return ret; } /* * Handle the case of a page which we actually need to copy to a new page, * either due to COW or unsharing. * * Called with mmap_lock locked and the old page referenced, but * without the ptl held. * * High level logic flow: * * - Allocate a page, copy the content of the old page to the new one. * - Handle book keeping and accounting - cgroups, mmu-notifiers, etc. * - Take the PTL. If the pte changed, bail out and release the allocated page * - If the pte is still the way we remember it, update the page table and all * relevant references. This includes dropping the reference the page-table * held to the old page, as well as updating the rmap. * - In any case, unlock the PTL and drop the reference we took to the old page. */ static vm_fault_t wp_page_copy(struct vm_fault *vmf) { const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE; struct vm_area_struct *vma = vmf->vma; struct mm_struct *mm = vma->vm_mm; struct folio *old_folio = NULL; struct folio *new_folio = NULL; pte_t entry; int page_copied = 0; struct mmu_notifier_range range; vm_fault_t ret; bool pfn_is_zero; delayacct_wpcopy_start(); if (vmf->page) old_folio = page_folio(vmf->page); ret = vmf_anon_prepare(vmf); if (unlikely(ret)) goto out; pfn_is_zero = is_zero_pfn(pte_pfn(vmf->orig_pte)); new_folio = folio_prealloc(mm, vma, vmf->address, pfn_is_zero); if (!new_folio) goto oom; if (!pfn_is_zero) { int err; err = __wp_page_copy_user(&new_folio->page, vmf->page, vmf); if (err) { /* * COW failed, if the fault was solved by other, * it's fine. If not, userspace would re-fault on * the same address and we will handle the fault * from the second attempt. * The -EHWPOISON case will not be retried. */ folio_put(new_folio); if (old_folio) folio_put(old_folio); delayacct_wpcopy_end(); return err == -EHWPOISON ? VM_FAULT_HWPOISON : 0; } kmsan_copy_page_meta(&new_folio->page, vmf->page); } __folio_mark_uptodate(new_folio); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, vmf->address & PAGE_MASK, (vmf->address & PAGE_MASK) + PAGE_SIZE); mmu_notifier_invalidate_range_start(&range); /* * Re-check the pte - we dropped the lock */ vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl); if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { if (old_folio) { if (!folio_test_anon(old_folio)) { dec_mm_counter(mm, mm_counter_file(old_folio)); inc_mm_counter(mm, MM_ANONPAGES); } } else { ksm_might_unmap_zero_page(mm, vmf->orig_pte); inc_mm_counter(mm, MM_ANONPAGES); } flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); entry = folio_mk_pte(new_folio, vma->vm_page_prot); entry = pte_sw_mkyoung(entry); if (unlikely(unshare)) { if (pte_soft_dirty(vmf->orig_pte)) entry = pte_mksoft_dirty(entry); if (pte_uffd_wp(vmf->orig_pte)) entry = pte_mkuffd_wp(entry); } else { entry = maybe_mkwrite(pte_mkdirty(entry), vma); } /* * Clear the pte entry and flush it first, before updating the * pte with the new entry, to keep TLBs on different CPUs in * sync. This code used to set the new PTE then flush TLBs, but * that left a window where the new PTE could be loaded into * some TLBs while the old PTE remains in others. */ ptep_clear_flush(vma, vmf->address, vmf->pte); folio_add_new_anon_rmap(new_folio, vma, vmf->address, RMAP_EXCLUSIVE); folio_add_lru_vma(new_folio, vma); BUG_ON(unshare && pte_write(entry)); set_pte_at(mm, vmf->address, vmf->pte, entry); update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); if (old_folio) { /* * Only after switching the pte to the new page may * we remove the mapcount here. Otherwise another * process may come and find the rmap count decremented * before the pte is switched to the new page, and * "reuse" the old page writing into it while our pte * here still points into it and can be read by other * threads. * * The critical issue is to order this * folio_remove_rmap_pte() with the ptp_clear_flush * above. Those stores are ordered by (if nothing else,) * the barrier present in the atomic_add_negative * in folio_remove_rmap_pte(); * * Then the TLB flush in ptep_clear_flush ensures that * no process can access the old page before the * decremented mapcount is visible. And the old page * cannot be reused until after the decremented * mapcount is visible. So transitively, TLBs to * old page will be flushed before it can be reused. */ folio_remove_rmap_pte(old_folio, vmf->page, vma); } /* Free the old page.. */ new_folio = old_folio; page_copied = 1; pte_unmap_unlock(vmf->pte, vmf->ptl); } else if (vmf->pte) { update_mmu_tlb(vma, vmf->address, vmf->pte); pte_unmap_unlock(vmf->pte, vmf->ptl); } mmu_notifier_invalidate_range_end(&range); if (new_folio) folio_put(new_folio); if (old_folio) { if (page_copied) free_swap_cache(old_folio); folio_put(old_folio); } delayacct_wpcopy_end(); return 0; oom: ret = VM_FAULT_OOM; out: if (old_folio) folio_put(old_folio); delayacct_wpcopy_end(); return ret; } /** * finish_mkwrite_fault - finish page fault for a shared mapping, making PTE * writeable once the page is prepared * * @vmf: structure describing the fault * @folio: the folio of vmf->page * * This function handles all that is needed to finish a write page fault in a * shared mapping due to PTE being read-only once the mapped page is prepared. * It handles locking of PTE and modifying it. * * The function expects the page to be locked or other protection against * concurrent faults / writeback (such as DAX radix tree locks). * * Return: %0 on success, %VM_FAULT_NOPAGE when PTE got changed before * we acquired PTE lock. */ static vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf, struct folio *folio) { WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED)); vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (!vmf->pte) return VM_FAULT_NOPAGE; /* * We might have raced with another page fault while we released the * pte_offset_map_lock. */ if (!pte_same(ptep_get(vmf->pte), vmf->orig_pte)) { update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); pte_unmap_unlock(vmf->pte, vmf->ptl); return VM_FAULT_NOPAGE; } wp_page_reuse(vmf, folio); return 0; } /* * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED * mapping */ static vm_fault_t wp_pfn_shared(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) { vm_fault_t ret; pte_unmap_unlock(vmf->pte, vmf->ptl); ret = vmf_can_call_fault(vmf); if (ret) return ret; vmf->flags |= FAULT_FLAG_MKWRITE; ret = vma->vm_ops->pfn_mkwrite(vmf); if (ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)) return ret; return finish_mkwrite_fault(vmf, NULL); } wp_page_reuse(vmf, NULL); return 0; } static vm_fault_t wp_page_shared(struct vm_fault *vmf, struct folio *folio) __releases(vmf->ptl) { struct vm_area_struct *vma = vmf->vma; vm_fault_t ret = 0; folio_get(folio); if (vma->vm_ops && vma->vm_ops->page_mkwrite) { vm_fault_t tmp; pte_unmap_unlock(vmf->pte, vmf->ptl); tmp = vmf_can_call_fault(vmf); if (tmp) { folio_put(folio); return tmp; } tmp = do_page_mkwrite(vmf, folio); if (unlikely(!tmp || (tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) { folio_put(folio); return tmp; } tmp = finish_mkwrite_fault(vmf, folio); if (unlikely(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) { folio_unlock(folio); folio_put(folio); return tmp; } } else { wp_page_reuse(vmf, folio); folio_lock(folio); } ret |= fault_dirty_shared_page(vmf); folio_put(folio); return ret; } #ifdef CONFIG_TRANSPARENT_HUGEPAGE static bool __wp_can_reuse_large_anon_folio(struct folio *folio, struct vm_area_struct *vma) { bool exclusive = false; /* Let's just free up a large folio if only a single page is mapped. */ if (folio_large_mapcount(folio) <= 1) return false; /* * The assumption for anonymous folios is that each page can only get * mapped once into each MM. The only exception are KSM folios, which * are always small. * * Each taken mapcount must be paired with exactly one taken reference, * whereby the refcount must be incremented before the mapcount when * mapping a page, and the refcount must be decremented after the * mapcount when unmapping a page. * * If all folio references are from mappings, and all mappings are in * the page tables of this MM, then this folio is exclusive to this MM. */ if (test_bit(FOLIO_MM_IDS_SHARED_BITNUM, &folio->_mm_ids)) return false; VM_WARN_ON_ONCE(folio_test_ksm(folio)); if (unlikely(folio_test_swapcache(folio))) { /* * Note: freeing up the swapcache will fail if some PTEs are * still swap entries. */ if (!folio_trylock(folio)) return false; folio_free_swap(folio); folio_unlock(folio); } if (folio_large_mapcount(folio) != folio_ref_count(folio)) return false; /* Stabilize the mapcount vs. refcount and recheck. */ folio_lock_large_mapcount(folio); VM_WARN_ON_ONCE_FOLIO(folio_large_mapcount(folio) > folio_ref_count(folio), folio); if (test_bit(FOLIO_MM_IDS_SHARED_BITNUM, &folio->_mm_ids)) goto unlock; if (folio_large_mapcount(folio) != folio_ref_count(folio)) goto unlock; VM_WARN_ON_ONCE_FOLIO(folio_large_mapcount(folio) > folio_nr_pages(folio), folio); VM_WARN_ON_ONCE_FOLIO(folio_entire_mapcount(folio), folio); VM_WARN_ON_ONCE(folio_mm_id(folio, 0) != vma->vm_mm->mm_id && folio_mm_id(folio, 1) != vma->vm_mm->mm_id); /* * Do we need the folio lock? Likely not. If there would have been * references from page migration/swapout, we would have detected * an additional folio reference and never ended up here. */ exclusive = true; unlock: folio_unlock_large_mapcount(folio); return exclusive; } #else /* !CONFIG_TRANSPARENT_HUGEPAGE */ static bool __wp_can_reuse_large_anon_folio(struct folio *folio, struct vm_area_struct *vma) { BUILD_BUG(); } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ static bool wp_can_reuse_anon_folio(struct folio *folio, struct vm_area_struct *vma) { if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && folio_test_large(folio)) return __wp_can_reuse_large_anon_folio(folio, vma); /* * We have to verify under folio lock: these early checks are * just an optimization to avoid locking the folio and freeing * the swapcache if there is little hope that we can reuse. * * KSM doesn't necessarily raise the folio refcount. */ if (folio_test_ksm(folio) || folio_ref_count(folio) > 3) return false; if (!folio_test_lru(folio)) /* * We cannot easily detect+handle references from * remote LRU caches or references to LRU folios. */ lru_add_drain(); if (folio_ref_count(folio) > 1 + folio_test_swapcache(folio)) return false; if (!folio_trylock(folio)) return false; if (folio_test_swapcache(folio)) folio_free_swap(folio); if (folio_test_ksm(folio) || folio_ref_count(folio) != 1) { folio_unlock(folio); return false; } /* * Ok, we've got the only folio reference from our mapping * and the folio is locked, it's dark out, and we're wearing * sunglasses. Hit it. */ folio_move_anon_rmap(folio, vma); folio_unlock(folio); return true; } /* * This routine handles present pages, when * * users try to write to a shared page (FAULT_FLAG_WRITE) * * GUP wants to take a R/O pin on a possibly shared anonymous page * (FAULT_FLAG_UNSHARE) * * It is done by copying the page to a new address and decrementing the * shared-page counter for the old page. * * Note that this routine assumes that the protection checks have been * done by the caller (the low-level page fault routine in most cases). * Thus, with FAULT_FLAG_WRITE, we can safely just mark it writable once we've * done any necessary COW. * * In case of FAULT_FLAG_WRITE, we also mark the page dirty at this point even * though the page will change only once the write actually happens. This * avoids a few races, and potentially makes it more efficient. * * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), with pte both mapped and locked. * We return with mmap_lock still held, but pte unmapped and unlocked. */ static vm_fault_t do_wp_page(struct vm_fault *vmf) __releases(vmf->ptl) { const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE; struct vm_area_struct *vma = vmf->vma; struct folio *folio = NULL; pte_t pte; if (likely(!unshare)) { if (userfaultfd_pte_wp(vma, ptep_get(vmf->pte))) { if (!userfaultfd_wp_async(vma)) { pte_unmap_unlock(vmf->pte, vmf->ptl); return handle_userfault(vmf, VM_UFFD_WP); } /* * Nothing needed (cache flush, TLB invalidations, * etc.) because we're only removing the uffd-wp bit, * which is completely invisible to the user. */ pte = pte_clear_uffd_wp(ptep_get(vmf->pte)); set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); /* * Update this to be prepared for following up CoW * handling */ vmf->orig_pte = pte; } /* * Userfaultfd write-protect can defer flushes. Ensure the TLB * is flushed in this case before copying. */ if (unlikely(userfaultfd_wp(vmf->vma) && mm_tlb_flush_pending(vmf->vma->vm_mm))) flush_tlb_page(vmf->vma, vmf->address); } vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte); if (vmf->page) folio = page_folio(vmf->page); /* * Shared mapping: we are guaranteed to have VM_WRITE and * FAULT_FLAG_WRITE set at this point. */ if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) { /* * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a * VM_PFNMAP VMA. FS DAX also wants ops->pfn_mkwrite called. * * We should not cow pages in a shared writeable mapping. * Just mark the pages writable and/or call ops->pfn_mkwrite. */ if (!vmf->page || is_fsdax_page(vmf->page)) { vmf->page = NULL; return wp_pfn_shared(vmf); } return wp_page_shared(vmf, folio); } /* * Private mapping: create an exclusive anonymous page copy if reuse * is impossible. We might miss VM_WRITE for FOLL_FORCE handling. * * If we encounter a page that is marked exclusive, we must reuse * the page without further checks. */ if (folio && folio_test_anon(folio) && (PageAnonExclusive(vmf->page) || wp_can_reuse_anon_folio(folio, vma))) { if (!PageAnonExclusive(vmf->page)) SetPageAnonExclusive(vmf->page); if (unlikely(unshare)) { pte_unmap_unlock(vmf->pte, vmf->ptl); return 0; } wp_page_reuse(vmf, folio); return 0; } /* * Ok, we need to copy. Oh, well.. */ if (folio) folio_get(folio); pte_unmap_unlock(vmf->pte, vmf->ptl); #ifdef CONFIG_KSM if (folio && folio_test_ksm(folio)) count_vm_event(COW_KSM); #endif return wp_page_copy(vmf); } static void unmap_mapping_range_vma(struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr, struct zap_details *details) { zap_page_range_single(vma, start_addr, end_addr - start_addr, details); } static inline void unmap_mapping_range_tree(struct rb_root_cached *root, pgoff_t first_index, pgoff_t last_index, struct zap_details *details) { struct vm_area_struct *vma; pgoff_t vba, vea, zba, zea; vma_interval_tree_foreach(vma, root, first_index, last_index) { vba = vma->vm_pgoff; vea = vba + vma_pages(vma) - 1; zba = max(first_index, vba); zea = min(last_index, vea); unmap_mapping_range_vma(vma, ((zba - vba) << PAGE_SHIFT) + vma->vm_start, ((zea - vba + 1) << PAGE_SHIFT) + vma->vm_start, details); } } /** * unmap_mapping_folio() - Unmap single folio from processes. * @folio: The locked folio to be unmapped. * * Unmap this folio from any userspace process which still has it mmaped. * Typically, for efficiency, the range of nearby pages has already been * unmapped by unmap_mapping_pages() or unmap_mapping_range(). But once * truncation or invalidation holds the lock on a folio, it may find that * the page has been remapped again: and then uses unmap_mapping_folio() * to unmap it finally. */ void unmap_mapping_folio(struct folio *folio) { struct address_space *mapping = folio->mapping; struct zap_details details = { }; pgoff_t first_index; pgoff_t last_index; VM_BUG_ON(!folio_test_locked(folio)); first_index = folio->index; last_index = folio_next_index(folio) - 1; details.even_cows = false; details.single_folio = folio; details.zap_flags = ZAP_FLAG_DROP_MARKER; i_mmap_lock_read(mapping); if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))) unmap_mapping_range_tree(&mapping->i_mmap, first_index, last_index, &details); i_mmap_unlock_read(mapping); } /** * unmap_mapping_pages() - Unmap pages from processes. * @mapping: The address space containing pages to be unmapped. * @start: Index of first page to be unmapped. * @nr: Number of pages to be unmapped. 0 to unmap to end of file. * @even_cows: Whether to unmap even private COWed pages. * * Unmap the pages in this address space from any userspace process which * has them mmaped. Generally, you want to remove COWed pages as well when * a file is being truncated, but not when invalidating pages from the page * cache. */ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start, pgoff_t nr, bool even_cows) { struct zap_details details = { }; pgoff_t first_index = start; pgoff_t last_index = start + nr - 1; details.even_cows = even_cows; if (last_index < first_index) last_index = ULONG_MAX; i_mmap_lock_read(mapping); if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))) unmap_mapping_range_tree(&mapping->i_mmap, first_index, last_index, &details); i_mmap_unlock_read(mapping); } EXPORT_SYMBOL_GPL(unmap_mapping_pages); /** * unmap_mapping_range - unmap the portion of all mmaps in the specified * address_space corresponding to the specified byte range in the underlying * file. * * @mapping: the address space containing mmaps to be unmapped. * @holebegin: byte in first page to unmap, relative to the start of * the underlying file. This will be rounded down to a PAGE_SIZE * boundary. Note that this is different from truncate_pagecache(), which * must keep the partial page. In contrast, we must get rid of * partial pages. * @holelen: size of prospective hole in bytes. This will be rounded * up to a PAGE_SIZE boundary. A holelen of zero truncates to the * end of the file. * @even_cows: 1 when truncating a file, unmap even private COWed pages; * but 0 when invalidating pagecache, don't throw away private data. */ void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows) { pgoff_t hba = (pgoff_t)(holebegin) >> PAGE_SHIFT; pgoff_t hlen = ((pgoff_t)(holelen) + PAGE_SIZE - 1) >> PAGE_SHIFT; /* Check for overflow. */ if (sizeof(holelen) > sizeof(hlen)) { long long holeend = (holebegin + holelen + PAGE_SIZE - 1) >> PAGE_SHIFT; if (holeend & ~(long long)ULONG_MAX) hlen = ULONG_MAX - hba + 1; } unmap_mapping_pages(mapping, hba, hlen, even_cows); } EXPORT_SYMBOL(unmap_mapping_range); /* * Restore a potential device exclusive pte to a working pte entry */ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf) { struct folio *folio = page_folio(vmf->page); struct vm_area_struct *vma = vmf->vma; struct mmu_notifier_range range; vm_fault_t ret; /* * We need a reference to lock the folio because we don't hold * the PTL so a racing thread can remove the device-exclusive * entry and unmap it. If the folio is free the entry must * have been removed already. If it happens to have already * been re-allocated after being freed all we do is lock and * unlock it. */ if (!folio_try_get(folio)) return 0; ret = folio_lock_or_retry(folio, vmf); if (ret) { folio_put(folio); return ret; } mmu_notifier_range_init_owner(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, vmf->address & PAGE_MASK, (vmf->address & PAGE_MASK) + PAGE_SIZE, NULL); mmu_notifier_invalidate_range_start(&range); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte))) restore_exclusive_pte(vma, folio, vmf->page, vmf->address, vmf->pte, vmf->orig_pte); if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); folio_unlock(folio); folio_put(folio); mmu_notifier_invalidate_range_end(&range); return 0; } static inline bool should_try_to_free_swap(struct folio *folio, struct vm_area_struct *vma, unsigned int fault_flags) { if (!folio_test_swapcache(folio)) return false; if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) || folio_test_mlocked(folio)) return true; /* * If we want to map a page that's in the swapcache writable, we * have to detect via the refcount if we're really the exclusive * user. Try freeing the swapcache to get rid of the swapcache * reference only in case it's likely that we'll be the exlusive user. */ return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) && folio_ref_count(folio) == (1 + folio_nr_pages(folio)); } static vm_fault_t pte_marker_clear(struct vm_fault *vmf) { vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (!vmf->pte) return 0; /* * Be careful so that we will only recover a special uffd-wp pte into a * none pte. Otherwise it means the pte could have changed, so retry. * * This should also cover the case where e.g. the pte changed * quickly from a PTE_MARKER_UFFD_WP into PTE_MARKER_POISONED. * So is_pte_marker() check is not enough to safely drop the pte. */ if (pte_same(vmf->orig_pte, ptep_get(vmf->pte))) pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte); pte_unmap_unlock(vmf->pte, vmf->ptl); return 0; } static vm_fault_t do_pte_missing(struct vm_fault *vmf) { if (vma_is_anonymous(vmf->vma)) return do_anonymous_page(vmf); else return do_fault(vmf); } /* * This is actually a page-missing access, but with uffd-wp special pte * installed. It means this pte was wr-protected before being unmapped. */ static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf) { /* * Just in case there're leftover special ptes even after the region * got unregistered - we can simply clear them. */ if (unlikely(!userfaultfd_wp(vmf->vma))) return pte_marker_clear(vmf); return do_pte_missing(vmf); } static vm_fault_t handle_pte_marker(struct vm_fault *vmf) { swp_entry_t entry = pte_to_swp_entry(vmf->orig_pte); unsigned long marker = pte_marker_get(entry); /* * PTE markers should never be empty. If anything weird happened, * the best thing to do is to kill the process along with its mm. */ if (WARN_ON_ONCE(!marker)) return VM_FAULT_SIGBUS; /* Higher priority than uffd-wp when data corrupted */ if (marker & PTE_MARKER_POISONED) return VM_FAULT_HWPOISON; /* Hitting a guard page is always a fatal condition. */ if (marker & PTE_MARKER_GUARD) return VM_FAULT_SIGSEGV; if (pte_marker_entry_uffd_wp(entry)) return pte_marker_handle_uffd_wp(vmf); /* This is an unknown pte marker */ return VM_FAULT_SIGBUS; } static struct folio *__alloc_swap_folio(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct folio *folio; swp_entry_t entry; folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address); if (!folio) return NULL; entry = pte_to_swp_entry(vmf->orig_pte); if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, GFP_KERNEL, entry)) { folio_put(folio); return NULL; } return folio; } #ifdef CONFIG_TRANSPARENT_HUGEPAGE /* * Check if the PTEs within a range are contiguous swap entries * and have consistent swapcache, zeromap. */ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) { unsigned long addr; swp_entry_t entry; int idx; pte_t pte; addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); idx = (vmf->address - addr) / PAGE_SIZE; pte = ptep_get(ptep); if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx))) return false; entry = pte_to_swp_entry(pte); if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages) return false; /* * swap_read_folio() can't handle the case a large folio is hybridly * from different backends. And they are likely corner cases. Similar * things might be added once zswap support large folios. */ if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages)) return false; if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages)) return false; return true; } static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, unsigned long addr, unsigned long orders) { int order, nr; order = highest_order(orders); /* * To swap in a THP with nr pages, we require that its first swap_offset * is aligned with that number, as it was when the THP was swapped out. * This helps filter out most invalid entries. */ while (orders) { nr = 1 << order; if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr) break; order = next_order(&orders, order); } return orders; } static struct folio *alloc_swap_folio(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; unsigned long orders; struct folio *folio; unsigned long addr; swp_entry_t entry; spinlock_t *ptl; pte_t *pte; gfp_t gfp; int order; /* * If uffd is active for the vma we need per-page fault fidelity to * maintain the uffd semantics. */ if (unlikely(userfaultfd_armed(vma))) goto fallback; /* * A large swapped out folio could be partially or fully in zswap. We * lack handling for such cases, so fallback to swapping in order-0 * folio. */ if (!zswap_never_enabled()) goto fallback; entry = pte_to_swp_entry(vmf->orig_pte); /* * Get a list of all the (large) orders below PMD_ORDER that are enabled * and suitable for swapping THP. */ orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT, BIT(PMD_ORDER) - 1); orders = thp_vma_suitable_orders(vma, vmf->address, orders); orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); if (!orders) goto fallback; pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl); if (unlikely(!pte)) goto fallback; /* * For do_swap_page, find the highest order where the aligned range is * completely swap entries with contiguous swap offsets. */ order = highest_order(orders); while (orders) { addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order)) break; order = next_order(&orders, order); } pte_unmap_unlock(pte, ptl); /* Try allocating the highest of the remaining orders. */ gfp = vma_thp_gfp_mask(vma); while (orders) { addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); folio = vma_alloc_folio(gfp, order, vma, addr); if (folio) { if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, gfp, entry)) return folio; count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK_CHARGE); folio_put(folio); } count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK); order = next_order(&orders, order); } fallback: return __alloc_swap_folio(vmf); } #else /* !CONFIG_TRANSPARENT_HUGEPAGE */ static struct folio *alloc_swap_folio(struct vm_fault *vmf) { return __alloc_swap_folio(vmf); } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq); /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. * We return with pte unmapped and unlocked. * * We return with the mmap_lock locked or unlocked in the same cases * as does filemap_fault(). */ vm_fault_t do_swap_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct folio *swapcache, *folio = NULL; DECLARE_WAITQUEUE(wait, current); struct page *page; struct swap_info_struct *si = NULL; rmap_t rmap_flags = RMAP_NONE; bool need_clear_cache = false; bool exclusive = false; swp_entry_t entry; pte_t pte; vm_fault_t ret = 0; void *shadow = NULL; int nr_pages; unsigned long page_idx; unsigned long address; pte_t *ptep; if (!pte_unmap_same(vmf)) goto out; entry = pte_to_swp_entry(vmf->orig_pte); if (unlikely(non_swap_entry(entry))) { if (is_migration_entry(entry)) { migration_entry_wait(vma->vm_mm, vmf->pmd, vmf->address); } else if (is_device_exclusive_entry(entry)) { vmf->page = pfn_swap_entry_to_page(entry); ret = remove_device_exclusive_entry(vmf); } else if (is_device_private_entry(entry)) { if (vmf->flags & FAULT_FLAG_VMA_LOCK) { /* * migrate_to_ram is not yet ready to operate * under VMA lock. */ vma_end_read(vma); ret = VM_FAULT_RETRY; goto out; } vmf->page = pfn_swap_entry_to_page(entry); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte))) goto unlock; /* * Get a page reference while we know the page can't be * freed. */ if (trylock_page(vmf->page)) { struct dev_pagemap *pgmap; get_page(vmf->page); pte_unmap_unlock(vmf->pte, vmf->ptl); pgmap = page_pgmap(vmf->page); ret = pgmap->ops->migrate_to_ram(vmf); unlock_page(vmf->page); put_page(vmf->page); } else { pte_unmap_unlock(vmf->pte, vmf->ptl); } } else if (is_hwpoison_entry(entry)) { ret = VM_FAULT_HWPOISON; } else if (is_pte_marker_entry(entry)) { ret = handle_pte_marker(vmf); } else { print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL); ret = VM_FAULT_SIGBUS; } goto out; } /* Prevent swapoff from happening to us. */ si = get_swap_device(entry); if (unlikely(!si)) goto out; folio = swap_cache_get_folio(entry); if (folio) swap_update_readahead(folio, vma, vmf->address); swapcache = folio; if (!folio) { if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1) { /* skip swapcache */ folio = alloc_swap_folio(vmf); if (folio) { __folio_set_locked(folio); __folio_set_swapbacked(folio); nr_pages = folio_nr_pages(folio); if (folio_test_large(folio)) entry.val = ALIGN_DOWN(entry.val, nr_pages); /* * Prevent parallel swapin from proceeding with * the cache flag. Otherwise, another thread * may finish swapin first, free the entry, and * swapout reusing the same entry. It's * undetectable as pte_same() returns true due * to entry reuse. */ if (swapcache_prepare(entry, nr_pages)) { /* * Relax a bit to prevent rapid * repeated page faults. */ add_wait_queue(&swapcache_wq, &wait); schedule_timeout_uninterruptible(1); remove_wait_queue(&swapcache_wq, &wait); goto out_page; } need_clear_cache = true; memcg1_swapin(entry, nr_pages); shadow = swap_cache_get_shadow(entry); if (shadow) workingset_refault(folio, shadow); folio_add_lru(folio); /* To provide entry to swap_read_folio() */ folio->swap = entry; swap_read_folio(folio, NULL); folio->private = NULL; } } else { folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf); swapcache = folio; } if (!folio) { /* * Back out if somebody else faulted in this pte * while we released the pte lock. */ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte))) ret = VM_FAULT_OOM; goto unlock; } /* Had to read the page from swap area: Major fault */ ret = VM_FAULT_MAJOR; count_vm_event(PGMAJFAULT); count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); } ret |= folio_lock_or_retry(folio, vmf); if (ret & VM_FAULT_RETRY) goto out_release; page = folio_file_page(folio, swp_offset(entry)); if (swapcache) { /* * Make sure folio_free_swap() or swapoff did not release the * swapcache from under us. The page pin, and pte_same test * below, are not enough to exclude that. Even if it is still * swapcache, we need to check that the page's swap has not * changed. */ if (unlikely(!folio_matches_swap_entry(folio, entry))) goto out_page; if (unlikely(PageHWPoison(page))) { /* * hwpoisoned dirty swapcache pages are kept for killing * owner processes (which may be unknown at hwpoison time) */ ret = VM_FAULT_HWPOISON; goto out_page; } /* * KSM sometimes has to copy on read faults, for example, if * folio->index of non-ksm folios would be nonlinear inside the * anon VMA -- the ksm flag is lost on actual swapout. */ folio = ksm_might_need_to_copy(folio, vma, vmf->address); if (unlikely(!folio)) { ret = VM_FAULT_OOM; folio = swapcache; goto out_page; } else if (unlikely(folio == ERR_PTR(-EHWPOISON))) { ret = VM_FAULT_HWPOISON; folio = swapcache; goto out_page; } if (folio != swapcache) page = folio_page(folio, 0); /* * If we want to map a page that's in the swapcache writable, we * have to detect via the refcount if we're really the exclusive * owner. Try removing the extra reference from the local LRU * caches if required. */ if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache && !folio_test_ksm(folio) && !folio_test_lru(folio)) lru_add_drain(); } folio_throttle_swaprate(folio, GFP_KERNEL); /* * Back out if somebody else already faulted in this pte. */ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte))) goto out_nomap; if (unlikely(!folio_test_uptodate(folio))) { ret = VM_FAULT_SIGBUS; goto out_nomap; } /* allocated large folios for SWP_SYNCHRONOUS_IO */ if (folio_test_large(folio) && !folio_test_swapcache(folio)) { unsigned long nr = folio_nr_pages(folio); unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; pte_t *folio_ptep = vmf->pte - idx; pte_t folio_pte = ptep_get(folio_ptep); if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) || swap_pte_batch(folio_ptep, nr, folio_pte) != nr) goto out_nomap; page_idx = idx; address = folio_start; ptep = folio_ptep; goto check_folio; } nr_pages = 1; page_idx = 0; address = vmf->address; ptep = vmf->pte; if (folio_test_large(folio) && folio_test_swapcache(folio)) { int nr = folio_nr_pages(folio); unsigned long idx = folio_page_idx(folio, page); unsigned long folio_start = address - idx * PAGE_SIZE; unsigned long folio_end = folio_start + nr * PAGE_SIZE; pte_t *folio_ptep; pte_t folio_pte; if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start))) goto check_folio; if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end))) goto check_folio; folio_ptep = vmf->pte - idx; folio_pte = ptep_get(folio_ptep); if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) || swap_pte_batch(folio_ptep, nr, folio_pte) != nr) goto check_folio; page_idx = idx; address = folio_start; ptep = folio_ptep; nr_pages = nr; entry = folio->swap; page = &folio->page; } check_folio: /* * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte * must never point at an anonymous page in the swapcache that is * PG_anon_exclusive. Sanity check that this holds and especially, that * no filesystem set PG_mappedtodisk on a page in the swapcache. Sanity * check after taking the PT lock and making sure that nobody * concurrently faulted in this page and set PG_anon_exclusive. */ BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio)); BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page)); /* * Check under PT lock (to protect against concurrent fork() sharing * the swap entry concurrently) for certainly exclusive pages. */ if (!folio_test_ksm(folio)) { exclusive = pte_swp_exclusive(vmf->orig_pte); if (folio != swapcache) { /* * We have a fresh page that is not exposed to the * swapcache -> certainly exclusive. */ exclusive = true; } else if (exclusive && folio_test_writeback(folio) && data_race(si->flags & SWP_STABLE_WRITES)) { /* * This is tricky: not all swap backends support * concurrent page modifications while under writeback. * * So if we stumble over such a page in the swapcache * we must not set the page exclusive, otherwise we can * map it writable without further checks and modify it * while still under writeback. * * For these problematic swap backends, simply drop the * exclusive marker: this is perfectly fine as we start * writeback only if we fully unmapped the page and * there are no unexpected references on the page after * unmapping succeeded. After fully unmapped, no * further GUP references (FOLL_GET and FOLL_PIN) can * appear, so dropping the exclusive marker and mapping * it only R/O is fine. */ exclusive = false; } } /* * Some architectures may have to restore extra metadata to the page * when reading from swap. This metadata may be indexed by swap entry * so this must be called before swap_free(). */ arch_swap_restore(folio_swap(entry, folio), folio); /* * Remove the swap entry and conditionally try to free up the swapcache. * We're already holding a reference on the page but haven't mapped it * yet. */ swap_free_nr(entry, nr_pages); if (should_try_to_free_swap(folio, vma, vmf->flags)) folio_free_swap(folio); add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); pte = mk_pte(page, vma->vm_page_prot); if (pte_swp_soft_dirty(vmf->orig_pte)) pte = pte_mksoft_dirty(pte); if (pte_swp_uffd_wp(vmf->orig_pte)) pte = pte_mkuffd_wp(pte); /* * Same logic as in do_wp_page(); however, optimize for pages that are * certainly not shared either because we just allocated them without * exposing them to the swapcache or because the swap entry indicates * exclusivity. */ if (!folio_test_ksm(folio) && (exclusive || folio_ref_count(folio) == 1)) { if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) && !pte_needs_soft_dirty_wp(vma, pte)) { pte = pte_mkwrite(pte, vma); if (vmf->flags & FAULT_FLAG_WRITE) { pte = pte_mkdirty(pte); vmf->flags &= ~FAULT_FLAG_WRITE; } } rmap_flags |= RMAP_EXCLUSIVE; } folio_ref_add(folio, nr_pages - 1); flush_icache_pages(vma, page, nr_pages); vmf->orig_pte = pte_advance_pfn(pte, page_idx); /* ksm created a completely new copy */ if (unlikely(folio != swapcache && swapcache)) { folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); } else if (!folio_test_anon(folio)) { /* * We currently only expect small !anon folios which are either * fully exclusive or fully shared, or new allocated large * folios which are fully exclusive. If we ever get large * folios within swapcache here, we have to be careful. */ VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); folio_add_new_anon_rmap(folio, vma, address, rmap_flags); } else { folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address, rmap_flags); } VM_BUG_ON(!folio_test_anon(folio) || (pte_write(pte) && !PageAnonExclusive(page))); set_ptes(vma->vm_mm, address, ptep, pte, nr_pages); arch_do_swap_page_nr(vma->vm_mm, vma, address, pte, pte, nr_pages); folio_unlock(folio); if (folio != swapcache && swapcache) { /* * Hold the lock to avoid the swap entry to be reused * until we take the PT lock for the pte_same() check * (to avoid false positives from pte_same). For * further safety release the lock after the swap_free * so that the swap count won't change under a * parallel locked swapcache. */ folio_unlock(swapcache); folio_put(swapcache); } if (vmf->flags & FAULT_FLAG_WRITE) { ret |= do_wp_page(vmf); if (ret & VM_FAULT_ERROR) ret &= VM_FAULT_ERROR; goto out; } /* No need to invalidate - it was non-present before */ update_mmu_cache_range(vmf, vma, address, ptep, nr_pages); unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); out: /* Clear the swap cache pin for direct swapin after PTL unlock */ if (need_clear_cache) { swapcache_clear(si, entry, nr_pages); if (waitqueue_active(&swapcache_wq)) wake_up(&swapcache_wq); } if (si) put_swap_device(si); return ret; out_nomap: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); out_page: folio_unlock(folio); out_release: folio_put(folio); if (folio != swapcache && swapcache) { folio_unlock(swapcache); folio_put(swapcache); } if (need_clear_cache) { swapcache_clear(si, entry, nr_pages); if (waitqueue_active(&swapcache_wq)) wake_up(&swapcache_wq); } if (si) put_swap_device(si); return ret; } static bool pte_range_none(pte_t *pte, int nr_pages) { int i; for (i = 0; i < nr_pages; i++) { if (!pte_none(ptep_get_lockless(pte + i))) return false; } return true; } static struct folio *alloc_anon_folio(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; #ifdef CONFIG_TRANSPARENT_HUGEPAGE unsigned long orders; struct folio *folio; unsigned long addr; pte_t *pte; gfp_t gfp; int order; /* * If uffd is active for the vma we need per-page fault fidelity to * maintain the uffd semantics. */ if (unlikely(userfaultfd_armed(vma))) goto fallback; /* * Get a list of all the (large) orders below PMD_ORDER that are enabled * for this vma. Then filter out the orders that can't be allocated over * the faulting address and still be fully contained in the vma. */ orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT, BIT(PMD_ORDER) - 1); orders = thp_vma_suitable_orders(vma, vmf->address, orders); if (!orders) goto fallback; pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); if (!pte) return ERR_PTR(-EAGAIN); /* * Find the highest order where the aligned range is completely * pte_none(). Note that all remaining orders will be completely * pte_none(). */ order = highest_order(orders); while (orders) { addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); if (pte_range_none(pte + pte_index(addr), 1 << order)) break; order = next_order(&orders, order); } pte_unmap(pte); if (!orders) goto fallback; /* Try allocating the highest of the remaining orders. */ gfp = vma_thp_gfp_mask(vma); while (orders) { addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); folio = vma_alloc_folio(gfp, order, vma, addr); if (folio) { if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) { count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE); folio_put(folio); goto next; } folio_throttle_swaprate(folio, gfp); /* * When a folio is not zeroed during allocation * (__GFP_ZERO not used) or user folios require special * handling, folio_zero_user() is used to make sure * that the page corresponding to the faulting address * will be hot in the cache after zeroing. */ if (user_alloc_needs_zeroing()) folio_zero_user(folio, vmf->address); return folio; } next: count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK); order = next_order(&orders, order); } fallback: #endif return folio_prealloc(vma->vm_mm, vma, vmf->address, true); } /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. * We return with mmap_lock still held, but pte unmapped and unlocked. */ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; unsigned long addr = vmf->address; struct folio *folio; vm_fault_t ret = 0; int nr_pages = 1; pte_t entry; /* File mapping without ->vm_ops ? */ if (vma->vm_flags & VM_SHARED) return VM_FAULT_SIGBUS; /* * Use pte_alloc() instead of pte_alloc_map(), so that OOM can * be distinguished from a transient failure of pte_offset_map(). */ if (pte_alloc(vma->vm_mm, vmf->pmd)) return VM_FAULT_OOM; /* Use the zero-page for reads */ if (!(vmf->flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(vma->vm_mm)) { entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address), vma->vm_page_prot)); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (!vmf->pte) goto unlock; if (vmf_pte_changed(vmf)) { update_mmu_tlb(vma, vmf->address, vmf->pte); goto unlock; } ret = check_stable_address_space(vma->vm_mm); if (ret) goto unlock; /* Deliver the page fault to userland, check inside PT lock */ if (userfaultfd_missing(vma)) { pte_unmap_unlock(vmf->pte, vmf->ptl); return handle_userfault(vmf, VM_UFFD_MISSING); } goto setpte; } /* Allocate our own private page. */ ret = vmf_anon_prepare(vmf); if (ret) return ret; /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */ folio = alloc_anon_folio(vmf); if (IS_ERR(folio)) return 0; if (!folio) goto oom; nr_pages = folio_nr_pages(folio); addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); /* * The memory barrier inside __folio_mark_uptodate makes sure that * preceding stores to the page contents become visible before * the set_pte_at() write. */ __folio_mark_uptodate(folio); entry = folio_mk_pte(folio, vma->vm_page_prot); entry = pte_sw_mkyoung(entry); if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry), vma); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); if (!vmf->pte) goto release; if (nr_pages == 1 && vmf_pte_changed(vmf)) { update_mmu_tlb(vma, addr, vmf->pte); goto release; } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) { update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages); goto release; } ret = check_stable_address_space(vma->vm_mm); if (ret) goto release; /* Deliver the page fault to userland, check inside PT lock */ if (userfaultfd_missing(vma)) { pte_unmap_unlock(vmf->pte, vmf->ptl); folio_put(folio); return handle_userfault(vmf, VM_UFFD_MISSING); } folio_ref_add(folio, nr_pages - 1); add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC); folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); setpte: if (vmf_orig_pte_uffd_wp(vmf)) entry = pte_mkuffd_wp(entry); set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages); /* No need to invalidate - it was non-present before */ update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages); unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); return ret; release: folio_put(folio); goto unlock; oom: return VM_FAULT_OOM; } /* * The mmap_lock must have been held on entry, and may have been * released depending on flags and vma->vm_ops->fault() return value. * See filemap_fault() and __lock_page_retry(). */ static vm_fault_t __do_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct folio *folio; vm_fault_t ret; /* * Preallocate pte before we take page_lock because this might lead to * deadlocks for memcg reclaim which waits for pages under writeback: * lock_page(A) * SetPageWriteback(A) * unlock_page(A) * lock_page(B) * lock_page(B) * pte_alloc_one * shrink_folio_list * wait_on_page_writeback(A) * SetPageWriteback(B) * unlock_page(B) * # flush A, B to clear the writeback */ if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { vmf->prealloc_pte = pte_alloc_one(vma->vm_mm); if (!vmf->prealloc_pte) return VM_FAULT_OOM; } ret = vma->vm_ops->fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | VM_FAULT_DONE_COW))) return ret; folio = page_folio(vmf->page); if (unlikely(PageHWPoison(vmf->page))) { vm_fault_t poisonret = VM_FAULT_HWPOISON; if (ret & VM_FAULT_LOCKED) { if (page_mapped(vmf->page)) unmap_mapping_folio(folio); /* Retry if a clean folio was removed from the cache. */ if (mapping_evict_folio(folio->mapping, folio)) poisonret = VM_FAULT_NOPAGE; folio_unlock(folio); } folio_put(folio); vmf->page = NULL; return poisonret; } if (unlikely(!(ret & VM_FAULT_LOCKED))) folio_lock(folio); else VM_BUG_ON_PAGE(!folio_test_locked(folio), vmf->page); return ret; } #ifdef CONFIG_TRANSPARENT_HUGEPAGE static void deposit_prealloc_pte(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); /* * We are going to consume the prealloc table, * count that as nr_ptes. */ mm_inc_nr_ptes(vma->vm_mm); vmf->prealloc_pte = NULL; } vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; pmd_t entry; vm_fault_t ret = VM_FAULT_FALLBACK; /* * It is too late to allocate a small folio, we already have a large * folio in the pagecache: especially s390 KVM cannot tolerate any * PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any * PMD mappings if THPs are disabled. As we already have a THP, * behave as if we are forcing a collapse. */ if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags, /* forced_collapse=*/ true)) return ret; if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER)) return ret; if (folio_order(folio) != HPAGE_PMD_ORDER) return ret; page = &folio->page; /* * Just backoff if any subpage of a THP is corrupted otherwise * the corrupted page may mapped by PMD silently to escape the * check. This kind of THP just can be PTE mapped. Access to * the corrupted subpage should trigger SIGBUS as expected. */ if (unlikely(folio_test_has_hwpoisoned(folio))) return ret; /* * Archs like ppc64 need additional space to store information * related to pte entry. Use the preallocated table for that. */ if (arch_needs_pgtable_deposit() && !vmf->prealloc_pte) { vmf->prealloc_pte = pte_alloc_one(vma->vm_mm); if (!vmf->prealloc_pte) return VM_FAULT_OOM; } vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); if (unlikely(!pmd_none(*vmf->pmd))) goto out; flush_icache_pages(vma, page, HPAGE_PMD_NR); entry = folio_mk_pmd(folio, vma->vm_page_prot); if (write) entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); add_mm_counter(vma->vm_mm, mm_counter_file(folio), HPAGE_PMD_NR); folio_add_file_rmap_pmd(folio, page, vma); /* * deposit and withdraw with pmd lock held */ if (arch_needs_pgtable_deposit()) deposit_prealloc_pte(vmf); set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry); update_mmu_cache_pmd(vma, haddr, vmf->pmd); /* fault is handled */ ret = 0; count_vm_event(THP_FILE_MAPPED); out: spin_unlock(vmf->ptl); return ret; } #else vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *page) { return VM_FAULT_FALLBACK; } #endif /** * set_pte_range - Set a range of PTEs to point to pages in a folio. * @vmf: Fault decription. * @folio: The folio that contains @page. * @page: The first page to create a PTE for. * @nr: The number of PTEs to create. * @addr: The first address to create a PTE for. */ void set_pte_range(struct vm_fault *vmf, struct folio *folio, struct page *page, unsigned int nr, unsigned long addr) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; bool prefault = !in_range(vmf->address, addr, nr * PAGE_SIZE); pte_t entry; flush_icache_pages(vma, page, nr); entry = mk_pte(page, vma->vm_page_prot); if (prefault && arch_wants_old_prefaulted_pte()) entry = pte_mkold(entry); else entry = pte_sw_mkyoung(entry); if (write) entry = maybe_mkwrite(pte_mkdirty(entry), vma); else if (pte_write(entry) && folio_test_dirty(folio)) entry = pte_mkdirty(entry); if (unlikely(vmf_orig_pte_uffd_wp(vmf))) entry = pte_mkuffd_wp(entry); /* copy-on-write page */ if (write && !(vma->vm_flags & VM_SHARED)) { VM_BUG_ON_FOLIO(nr != 1, folio); folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); } else { folio_add_file_rmap_ptes(folio, page, nr, vma); } set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr); /* no need to invalidate: a not-present page won't be cached */ update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr); } static bool vmf_pte_changed(struct vm_fault *vmf) { if (vmf->flags & FAULT_FLAG_ORIG_PTE_VALID) return !pte_same(ptep_get(vmf->pte), vmf->orig_pte); return !pte_none(ptep_get(vmf->pte)); } /** * finish_fault - finish page fault once we have prepared the page to fault * * @vmf: structure describing the fault * * This function handles all that is needed to finish a page fault once the * page to fault in is prepared. It handles locking of PTEs, inserts PTE for * given page, adds reverse page mapping, handles memcg charges and LRU * addition. * * The function expects the page to be locked and on success it consumes a * reference of a page being mapped (for the PTE which maps it). * * Return: %0 on success, %VM_FAULT_ code in case of error. */ vm_fault_t finish_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct page *page; struct folio *folio; vm_fault_t ret; bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED); int type, nr_pages; unsigned long addr; bool needs_fallback = false; fallback: addr = vmf->address; /* Did we COW the page? */ if (is_cow) page = vmf->cow_page; else page = vmf->page; folio = page_folio(page); /* * check even for read faults because we might have lost our CoWed * page */ if (!(vma->vm_flags & VM_SHARED)) { ret = check_stable_address_space(vma->vm_mm); if (ret) return ret; } if (pmd_none(*vmf->pmd)) { if (folio_test_pmd_mappable(folio)) { ret = do_set_pmd(vmf, folio, page); if (ret != VM_FAULT_FALLBACK) return ret; } if (vmf->prealloc_pte) pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte); else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) return VM_FAULT_OOM; } nr_pages = folio_nr_pages(folio); /* Using per-page fault to maintain the uffd semantics */ if (unlikely(userfaultfd_armed(vma)) || unlikely(needs_fallback)) { nr_pages = 1; } else if (nr_pages > 1) { pgoff_t idx = folio_page_idx(folio, page); /* The page offset of vmf->address within the VMA. */ pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff; /* The index of the entry in the pagetable for fault page. */ pgoff_t pte_off = pte_index(vmf->address); /* * Fallback to per-page fault in case the folio size in page * cache beyond the VMA limits and PMD pagetable limits. */ if (unlikely(vma_off < idx || vma_off + (nr_pages - idx) > vma_pages(vma) || pte_off < idx || pte_off + (nr_pages - idx) > PTRS_PER_PTE)) { nr_pages = 1; } else { /* Now we can set mappings for the whole large folio. */ addr = vmf->address - idx * PAGE_SIZE; page = &folio->page; } } vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); if (!vmf->pte) return VM_FAULT_NOPAGE; /* Re-check under ptl */ if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) { update_mmu_tlb(vma, addr, vmf->pte); ret = VM_FAULT_NOPAGE; goto unlock; } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) { needs_fallback = true; pte_unmap_unlock(vmf->pte, vmf->ptl); goto fallback; } folio_ref_add(folio, nr_pages - 1); set_pte_range(vmf, folio, page, nr_pages, addr); type = is_cow ? MM_ANONPAGES : mm_counter_file(folio); add_mm_counter(vma->vm_mm, type, nr_pages); ret = 0; unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); return ret; } static unsigned long fault_around_pages __read_mostly = 65536 >> PAGE_SHIFT; #ifdef CONFIG_DEBUG_FS static int fault_around_bytes_get(void *data, u64 *val) { *val = fault_around_pages << PAGE_SHIFT; return 0; } /* * fault_around_bytes must be rounded down to the nearest page order as it's * what do_fault_around() expects to see. */ static int fault_around_bytes_set(void *data, u64 val) { if (val / PAGE_SIZE > PTRS_PER_PTE) return -EINVAL; /* * The minimum value is 1 page, however this results in no fault-around * at all. See should_fault_around(). */ val = max(val, PAGE_SIZE); fault_around_pages = rounddown_pow_of_two(val) >> PAGE_SHIFT; return 0; } DEFINE_DEBUGFS_ATTRIBUTE(fault_around_bytes_fops, fault_around_bytes_get, fault_around_bytes_set, "%llu\n"); static int __init fault_around_debugfs(void) { debugfs_create_file_unsafe("fault_around_bytes", 0644, NULL, NULL, &fault_around_bytes_fops); return 0; } late_initcall(fault_around_debugfs); #endif /* * do_fault_around() tries to map few pages around the fault address. The hope * is that the pages will be needed soon and this will lower the number of * faults to handle. * * It uses vm_ops->map_pages() to map the pages, which skips the page if it's * not ready to be mapped: not up-to-date, locked, etc. * * This function doesn't cross VMA or page table boundaries, in order to call * map_pages() and acquire a PTE lock only once. * * fault_around_pages defines how many pages we'll try to map. * do_fault_around() expects it to be set to a power of two less than or equal * to PTRS_PER_PTE. * * The virtual address of the area that we map is naturally aligned to * fault_around_pages * PAGE_SIZE rounded down to the machine page size * (and therefore to page order). This way it's easier to guarantee * that we don't cross page table boundaries. */ static vm_fault_t do_fault_around(struct vm_fault *vmf) { pgoff_t nr_pages = READ_ONCE(fault_around_pages); pgoff_t pte_off = pte_index(vmf->address); /* The page offset of vmf->address within the VMA. */ pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff; pgoff_t from_pte, to_pte; vm_fault_t ret; /* The PTE offset of the start address, clamped to the VMA. */ from_pte = max(ALIGN_DOWN(pte_off, nr_pages), pte_off - min(pte_off, vma_off)); /* The PTE offset of the end address, clamped to the VMA and PTE. */ to_pte = min3(from_pte + nr_pages, (pgoff_t)PTRS_PER_PTE, pte_off + vma_pages(vmf->vma) - vma_off) - 1; if (pmd_none(*vmf->pmd)) { vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm); if (!vmf->prealloc_pte) return VM_FAULT_OOM; } rcu_read_lock(); ret = vmf->vma->vm_ops->map_pages(vmf, vmf->pgoff + from_pte - pte_off, vmf->pgoff + to_pte - pte_off); rcu_read_unlock(); return ret; } /* Return true if we should do read fault-around, false otherwise */ static inline bool should_fault_around(struct vm_fault *vmf) { /* No ->map_pages? No way to fault around... */ if (!vmf->vma->vm_ops->map_pages) return false; if (uffd_disable_fault_around(vmf->vma)) return false; /* A single page implies no faulting 'around' at all. */ return fault_around_pages > 1; } static vm_fault_t do_read_fault(struct vm_fault *vmf) { vm_fault_t ret = 0; struct folio *folio; /* * Let's call ->map_pages() first and use ->fault() as fallback * if page by the offset is not ready to be mapped (cold cache or * something). */ if (should_fault_around(vmf)) { ret = do_fault_around(vmf); if (ret) return ret; } ret = vmf_can_call_fault(vmf); if (ret) return ret; ret = __do_fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; ret |= finish_fault(vmf); folio = page_folio(vmf->page); folio_unlock(folio); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) folio_put(folio); return ret; } static vm_fault_t do_cow_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct folio *folio; vm_fault_t ret; ret = vmf_can_call_fault(vmf); if (!ret) ret = vmf_anon_prepare(vmf); if (ret) return ret; folio = folio_prealloc(vma->vm_mm, vma, vmf->address, false); if (!folio) return VM_FAULT_OOM; vmf->cow_page = &folio->page; ret = __do_fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) goto uncharge_out; if (ret & VM_FAULT_DONE_COW) return ret; if (copy_mc_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma)) { ret = VM_FAULT_HWPOISON; goto unlock; } __folio_mark_uptodate(folio); ret |= finish_fault(vmf); unlock: unlock_page(vmf->page); put_page(vmf->page); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) goto uncharge_out; return ret; uncharge_out: folio_put(folio); return ret; } static vm_fault_t do_shared_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; vm_fault_t ret, tmp; struct folio *folio; ret = vmf_can_call_fault(vmf); if (ret) return ret; ret = __do_fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; folio = page_folio(vmf->page); /* * Check if the backing address space wants to know that the page is * about to become writable */ if (vma->vm_ops->page_mkwrite) { folio_unlock(folio); tmp = do_page_mkwrite(vmf, folio); if (unlikely(!tmp || (tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) { folio_put(folio); return tmp; } } ret |= finish_fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) { folio_unlock(folio); folio_put(folio); return ret; } ret |= fault_dirty_shared_page(vmf); return ret; } /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults). * The mmap_lock may have been released depending on flags and our * return value. See filemap_fault() and __folio_lock_or_retry(). * If mmap_lock is released, vma may become invalid (for example * by other thread calling munmap()). */ static vm_fault_t do_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct mm_struct *vm_mm = vma->vm_mm; vm_fault_t ret; /* * The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */ if (!vma->vm_ops->fault) { vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (unlikely(!vmf->pte)) ret = VM_FAULT_SIGBUS; else { /* * Make sure this is not a temporary clearing of pte * by holding ptl and checking again. A R/M/W update * of pte involves: take ptl, clearing the pte so that * we don't have concurrent modification by hardware * followed by an update. */ if (unlikely(pte_none(ptep_get(vmf->pte)))) ret = VM_FAULT_SIGBUS; else ret = VM_FAULT_NOPAGE; pte_unmap_unlock(vmf->pte, vmf->ptl); } } else if (!(vmf->flags & FAULT_FLAG_WRITE)) ret = do_read_fault(vmf); else if (!(vma->vm_flags & VM_SHARED)) ret = do_cow_fault(vmf); else ret = do_shared_fault(vmf); /* preallocated pagetable is unused: free it */ if (vmf->prealloc_pte) { pte_free(vm_mm, vmf->prealloc_pte); vmf->prealloc_pte = NULL; } return ret; } int numa_migrate_check(struct folio *folio, struct vm_fault *vmf, unsigned long addr, int *flags, bool writable, int *last_cpupid) { struct vm_area_struct *vma = vmf->vma; /* * Avoid grouping on RO pages in general. RO pages shouldn't hurt as * much anyway since they can be in shared cache state. This misses * the case where a mapping is writable but the process never writes * to it but pte_write gets cleared during protection updates and * pte_dirty has unpredictable behaviour between PTE scan updates, * background writeback, dirty balancing and application behaviour. */ if (!writable) *flags |= TNF_NO_GROUP; /* * Flag if the folio is shared between multiple address spaces. This * is later used when determining whether to group tasks together */ if (folio_maybe_mapped_shared(folio) && (vma->vm_flags & VM_SHARED)) *flags |= TNF_SHARED; /* * For memory tiering mode, cpupid of slow memory page is used * to record page access time. So use default value. */ if (folio_use_access_time(folio)) *last_cpupid = (-1 & LAST_CPUPID_MASK); else *last_cpupid = folio_last_cpupid(folio); /* Record the current PID acceesing VMA */ vma_set_access_pid_bit(vma); count_vm_numa_event(NUMA_HINT_FAULTS); #ifdef CONFIG_NUMA_BALANCING count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1); #endif if (folio_nid(folio) == numa_node_id()) { count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL); *flags |= TNF_FAULT_LOCAL; } return mpol_misplaced(folio, vmf, addr); } static void numa_rebuild_single_mapping(struct vm_fault *vmf, struct vm_area_struct *vma, unsigned long fault_addr, pte_t *fault_pte, bool writable) { pte_t pte, old_pte; old_pte = ptep_modify_prot_start(vma, fault_addr, fault_pte); pte = pte_modify(old_pte, vma->vm_page_prot); pte = pte_mkyoung(pte); if (writable) pte = pte_mkwrite(pte, vma); ptep_modify_prot_commit(vma, fault_addr, fault_pte, old_pte, pte); update_mmu_cache_range(vmf, vma, fault_addr, fault_pte, 1); } static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct vm_area_struct *vma, struct folio *folio, pte_t fault_pte, bool ignore_writable, bool pte_write_upgrade) { int nr = pte_pfn(fault_pte) - folio_pfn(folio); unsigned long start, end, addr = vmf->address; unsigned long addr_start = addr - (nr << PAGE_SHIFT); unsigned long pt_start = ALIGN_DOWN(addr, PMD_SIZE); pte_t *start_ptep; /* Stay within the VMA and within the page table. */ start = max3(addr_start, pt_start, vma->vm_start); end = min3(addr_start + folio_size(folio), pt_start + PMD_SIZE, vma->vm_end); start_ptep = vmf->pte - ((addr - start) >> PAGE_SHIFT); /* Restore all PTEs' mapping of the large folio */ for (addr = start; addr != end; start_ptep++, addr += PAGE_SIZE) { pte_t ptent = ptep_get(start_ptep); bool writable = false; if (!pte_present(ptent) || !pte_protnone(ptent)) continue; if (pfn_folio(pte_pfn(ptent)) != folio) continue; if (!ignore_writable) { ptent = pte_modify(ptent, vma->vm_page_prot); writable = pte_write(ptent); if (!writable && pte_write_upgrade && can_change_pte_writable(vma, addr, ptent)) writable = true; } numa_rebuild_single_mapping(vmf, vma, addr, start_ptep, writable); } } static vm_fault_t do_numa_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct folio *folio = NULL; int nid = NUMA_NO_NODE; bool writable = false, ignore_writable = false; bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma); int last_cpupid; int target_nid; pte_t pte, old_pte; int flags = 0, nr_pages; /* * The pte cannot be used safely until we verify, while holding the page * table lock, that its contents have not changed during fault handling. */ spin_lock(vmf->ptl); /* Read the live PTE from the page tables: */ old_pte = ptep_get(vmf->pte); if (unlikely(!pte_same(old_pte, vmf->orig_pte))) { pte_unmap_unlock(vmf->pte, vmf->ptl); return 0; } pte = pte_modify(old_pte, vma->vm_page_prot); /* * Detect now whether the PTE could be writable; this information * is only valid while holding the PT lock. */ writable = pte_write(pte); if (!writable && pte_write_upgrade && can_change_pte_writable(vma, vmf->address, pte)) writable = true; folio = vm_normal_folio(vma, vmf->address, pte); if (!folio || folio_is_zone_device(folio)) goto out_map; nid = folio_nid(folio); nr_pages = folio_nr_pages(folio); target_nid = numa_migrate_check(folio, vmf, vmf->address, &flags, writable, &last_cpupid); if (target_nid == NUMA_NO_NODE) goto out_map; if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) { flags |= TNF_MIGRATE_FAIL; goto out_map; } /* The folio is isolated and isolation code holds a folio reference. */ pte_unmap_unlock(vmf->pte, vmf->ptl); writable = false; ignore_writable = true; /* Migrate to the requested node */ if (!migrate_misplaced_folio(folio, target_nid)) { nid = target_nid; flags |= TNF_MIGRATED; task_numa_fault(last_cpupid, nid, nr_pages, flags); return 0; } flags |= TNF_MIGRATE_FAIL; vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (unlikely(!vmf->pte)) return 0; if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { pte_unmap_unlock(vmf->pte, vmf->ptl); return 0; } out_map: /* * Make it present again, depending on how arch implements * non-accessible ptes, some can allow access by kernel mode. */ if (folio && folio_test_large(folio)) numa_rebuild_large_mapping(vmf, vma, folio, pte, ignore_writable, pte_write_upgrade); else numa_rebuild_single_mapping(vmf, vma, vmf->address, vmf->pte, writable); pte_unmap_unlock(vmf->pte, vmf->ptl); if (nid != NUMA_NO_NODE) task_numa_fault(last_cpupid, nid, nr_pages, flags); return 0; } static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; if (vma_is_anonymous(vma)) return do_huge_pmd_anonymous_page(vmf); if (vma->vm_ops->huge_fault) return vma->vm_ops->huge_fault(vmf, PMD_ORDER); return VM_FAULT_FALLBACK; } /* `inline' is required to avoid gcc 4.1.2 build error */ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE; vm_fault_t ret; if (vma_is_anonymous(vma)) { if (likely(!unshare) && userfaultfd_huge_pmd_wp(vma, vmf->orig_pmd)) { if (userfaultfd_wp_async(vmf->vma)) goto split; return handle_userfault(vmf, VM_UFFD_WP); } return do_huge_pmd_wp_page(vmf); } if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) { if (vma->vm_ops->huge_fault) { ret = vma->vm_ops->huge_fault(vmf, PMD_ORDER); if (!(ret & VM_FAULT_FALLBACK)) return ret; } } split: /* COW or write-notify handled on pte level: split pmd. */ __split_huge_pmd(vma, vmf->pmd, vmf->address, false); return VM_FAULT_FALLBACK; } static vm_fault_t create_huge_pud(struct vm_fault *vmf) { #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \ defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) struct vm_area_struct *vma = vmf->vma; /* No support for anonymous transparent PUD pages yet */ if (vma_is_anonymous(vma)) return VM_FAULT_FALLBACK; if (vma->vm_ops->huge_fault) return vma->vm_ops->huge_fault(vmf, PUD_ORDER); #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ return VM_FAULT_FALLBACK; } static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud) { #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \ defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) struct vm_area_struct *vma = vmf->vma; vm_fault_t ret; /* No support for anonymous transparent PUD pages yet */ if (vma_is_anonymous(vma)) goto split; if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) { if (vma->vm_ops->huge_fault) { ret = vma->vm_ops->huge_fault(vmf, PUD_ORDER); if (!(ret & VM_FAULT_FALLBACK)) return ret; } } split: /* COW or write-notify not handled on PUD level: split pud.*/ __split_huge_pud(vma, vmf->pud, vmf->address); #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ return VM_FAULT_FALLBACK; } /* * These routines also need to handle stuff like marking pages dirty * and/or accessed for architectures that don't do it in hardware (most * RISC architectures). The early dirtying is also good on the i386. * * There is also a hook called "update_mmu_cache()" that architectures * with external mmu caches can use to update those (ie the Sparc or * PowerPC hashed page tables that act as extended TLBs). * * We enter with non-exclusive mmap_lock (to exclude vma changes, but allow * concurrent faults). * * The mmap_lock may have been released depending on flags and our return value. * See filemap_fault() and __folio_lock_or_retry(). */ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) { pte_t entry; if (unlikely(pmd_none(*vmf->pmd))) { /* * Leave __pte_alloc() until later: because vm_ops->fault may * want to allocate huge page, and if we expose page table * for an instant, it will be difficult to retract from * concurrent faults and from rmap lookups. */ vmf->pte = NULL; vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID; } else { pmd_t dummy_pmdval; /* * A regular pmd is established and it can't morph into a huge * pmd by anon khugepaged, since that takes mmap_lock in write * mode; but shmem or file collapse to THP could still morph * it into a huge pmd: just retry later if so. * * Use the maywrite version to indicate that vmf->pte may be * modified, but since we will use pte_same() to detect the * change of the !pte_none() entry, there is no need to recheck * the pmdval. Here we chooes to pass a dummy variable instead * of NULL, which helps new user think about why this place is * special. */ vmf->pte = pte_offset_map_rw_nolock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &dummy_pmdval, &vmf->ptl); if (unlikely(!vmf->pte)) return 0; vmf->orig_pte = ptep_get_lockless(vmf->pte); vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; if (pte_none(vmf->orig_pte)) { pte_unmap(vmf->pte); vmf->pte = NULL; } } if (!vmf->pte) return do_pte_missing(vmf); if (!pte_present(vmf->orig_pte)) return do_swap_page(vmf); if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) return do_numa_page(vmf); spin_lock(vmf->ptl); entry = vmf->orig_pte; if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) { update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); goto unlock; } if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { if (!pte_write(entry)) return do_wp_page(vmf); else if (likely(vmf->flags & FAULT_FLAG_WRITE)) entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry); if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry, vmf->flags & FAULT_FLAG_WRITE)) { update_mmu_cache_range(vmf, vmf->vma, vmf->address, vmf->pte, 1); } else { /* Skip spurious TLB flush for retried page fault */ if (vmf->flags & FAULT_FLAG_TRIED) goto unlock; /* * This is needed only for protection faults but the arch code * is not yet telling us if this is a protection fault or not. * This still avoids useless tlb flushes for .text page faults * with threads. */ if (vmf->flags & FAULT_FLAG_WRITE) flush_tlb_fix_spurious_fault(vmf->vma, vmf->address, vmf->pte); } unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); return 0; } /* * On entry, we hold either the VMA lock or the mmap_lock * (FAULT_FLAG_VMA_LOCK tells you which). If VM_FAULT_RETRY is set in * the result, the mmap_lock is not held on exit. See filemap_fault() * and __folio_lock_or_retry(). */ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags) { struct vm_fault vmf = { .vma = vma, .address = address & PAGE_MASK, .real_address = address, .flags = flags, .pgoff = linear_page_index(vma, address), .gfp_mask = __get_fault_gfp_mask(vma), }; struct mm_struct *mm = vma->vm_mm; vm_flags_t vm_flags = vma->vm_flags; pgd_t *pgd; p4d_t *p4d; vm_fault_t ret; pgd = pgd_offset(mm, address); p4d = p4d_alloc(mm, pgd, address); if (!p4d) return VM_FAULT_OOM; vmf.pud = pud_alloc(mm, p4d, address); if (!vmf.pud) return VM_FAULT_OOM; retry_pud: if (pud_none(*vmf.pud) && thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PUD_ORDER)) { ret = create_huge_pud(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; } else { pud_t orig_pud = *vmf.pud; barrier(); if (pud_trans_huge(orig_pud)) { /* * TODO once we support anonymous PUDs: NUMA case and * FAULT_FLAG_UNSHARE handling. */ if ((flags & FAULT_FLAG_WRITE) && !pud_write(orig_pud)) { ret = wp_huge_pud(&vmf, orig_pud); if (!(ret & VM_FAULT_FALLBACK)) return ret; } else { huge_pud_set_accessed(&vmf, orig_pud); return 0; } } } vmf.pmd = pmd_alloc(mm, vmf.pud, address); if (!vmf.pmd) return VM_FAULT_OOM; /* Huge pud page fault raced with pmd_alloc? */ if (pud_trans_unstable(vmf.pud)) goto retry_pud; if (pmd_none(*vmf.pmd) && thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) { ret = create_huge_pmd(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; } else { vmf.orig_pmd = pmdp_get_lockless(vmf.pmd); if (unlikely(is_swap_pmd(vmf.orig_pmd))) { VM_BUG_ON(thp_migration_supported() && !is_pmd_migration_entry(vmf.orig_pmd)); if (is_pmd_migration_entry(vmf.orig_pmd)) pmd_migration_entry_wait(mm, vmf.pmd); return 0; } if (pmd_trans_huge(vmf.orig_pmd)) { if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) return do_huge_pmd_numa_page(&vmf); if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) && !pmd_write(vmf.orig_pmd)) { ret = wp_huge_pmd(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; } else { huge_pmd_set_accessed(&vmf); return 0; } } } return handle_pte_fault(&vmf); } /** * mm_account_fault - Do page fault accounting * @mm: mm from which memcg should be extracted. It can be NULL. * @regs: the pt_regs struct pointer. When set to NULL, will skip accounting * of perf event counters, but we'll still do the per-task accounting to * the task who triggered this page fault. * @address: the faulted address. * @flags: the fault flags. * @ret: the fault retcode. * * This will take care of most of the page fault accounting. Meanwhile, it * will also include the PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN] perf counter * updates. However, note that the handling of PERF_COUNT_SW_PAGE_FAULTS should * still be in per-arch page fault handlers at the entry of page fault. */ static inline void mm_account_fault(struct mm_struct *mm, struct pt_regs *regs, unsigned long address, unsigned int flags, vm_fault_t ret) { bool major; /* Incomplete faults will be accounted upon completion. */ if (ret & VM_FAULT_RETRY) return; /* * To preserve the behavior of older kernels, PGFAULT counters record * both successful and failed faults, as opposed to perf counters, * which ignore failed cases. */ count_vm_event(PGFAULT); count_memcg_event_mm(mm, PGFAULT); /* * Do not account for unsuccessful faults (e.g. when the address wasn't * valid). That includes arch_vma_access_permitted() failing before * reaching here. So this is not a "this many hardware page faults" * counter. We should use the hw profiling for that. */ if (ret & VM_FAULT_ERROR) return; /* * We define the fault as a major fault when the final successful fault * is VM_FAULT_MAJOR, or if it retried (which implies that we couldn't * handle it immediately previously). */ major = (ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED); if (major) current->maj_flt++; else current->min_flt++; /* * If the fault is done for GUP, regs will be NULL. We only do the * accounting for the per thread fault counters who triggered the * fault, and we skip the perf event updates. */ if (!regs) return; if (major) perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address); else perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); } #ifdef CONFIG_LRU_GEN static void lru_gen_enter_fault(struct vm_area_struct *vma) { /* the LRU algorithm only applies to accesses with recency */ current->in_lru_fault = vma_has_recency(vma); } static void lru_gen_exit_fault(void) { current->in_lru_fault = false; } #else static void lru_gen_enter_fault(struct vm_area_struct *vma) { } static void lru_gen_exit_fault(void) { } #endif /* CONFIG_LRU_GEN */ static vm_fault_t sanitize_fault_flags(struct vm_area_struct *vma, unsigned int *flags) { if (unlikely(*flags & FAULT_FLAG_UNSHARE)) { if (WARN_ON_ONCE(*flags & FAULT_FLAG_WRITE)) return VM_FAULT_SIGSEGV; /* * FAULT_FLAG_UNSHARE only applies to COW mappings. Let's * just treat it like an ordinary read-fault otherwise. */ if (!is_cow_mapping(vma->vm_flags)) *flags &= ~FAULT_FLAG_UNSHARE; } else if (*flags & FAULT_FLAG_WRITE) { /* Write faults on read-only mappings are impossible ... */ if (WARN_ON_ONCE(!(vma->vm_flags & VM_MAYWRITE))) return VM_FAULT_SIGSEGV; /* ... and FOLL_FORCE only applies to COW mappings. */ if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE) && !is_cow_mapping(vma->vm_flags))) return VM_FAULT_SIGSEGV; } #ifdef CONFIG_PER_VMA_LOCK /* * Per-VMA locks can't be used with FAULT_FLAG_RETRY_NOWAIT because of * the assumption that lock is dropped on VM_FAULT_RETRY. */ if (WARN_ON_ONCE((*flags & (FAULT_FLAG_VMA_LOCK | FAULT_FLAG_RETRY_NOWAIT)) == (FAULT_FLAG_VMA_LOCK | FAULT_FLAG_RETRY_NOWAIT))) return VM_FAULT_SIGSEGV; #endif return 0; } /* * By the time we get here, we already hold either the VMA lock or the * mmap_lock (FAULT_FLAG_VMA_LOCK tells you which). * * The mmap_lock may have been released depending on flags and our * return value. See filemap_fault() and __folio_lock_or_retry(). */ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags, struct pt_regs *regs) { /* If the fault handler drops the mmap_lock, vma may be freed */ struct mm_struct *mm = vma->vm_mm; vm_fault_t ret; bool is_droppable; __set_current_state(TASK_RUNNING); ret = sanitize_fault_flags(vma, &flags); if (ret) goto out; if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE, flags & FAULT_FLAG_INSTRUCTIO |