| 69 69 221 220 170 170 170 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 | #include <linux/gfp.h> #include <linux/initrd.h> #include <linux/ioport.h> #include <linux/swap.h> #include <linux/memblock.h> #include <linux/swapfile.h> #include <linux/swapops.h> #include <linux/kmemleak.h> #include <linux/sched/task.h> #include <linux/execmem.h> #include <asm/set_memory.h> #include <asm/cpu_device_id.h> #include <asm/e820/api.h> #include <asm/init.h> #include <asm/page.h> #include <asm/page_types.h> #include <asm/sections.h> #include <asm/setup.h> #include <asm/tlbflush.h> #include <asm/tlb.h> #include <asm/proto.h> #include <asm/dma.h> /* for MAX_DMA_PFN */ #include <asm/kaslr.h> #include <asm/hypervisor.h> #include <asm/cpufeature.h> #include <asm/pti.h> #include <asm/text-patching.h> #include <asm/memtype.h> #include <asm/paravirt.h> #include <asm/mmu_context.h> /* * We need to define the tracepoints somewhere, and tlb.c * is only compiled when SMP=y. */ #include <trace/events/tlb.h> #include "mm_internal.h" /* * Tables translating between page_cache_type_t and pte encoding. * * The default values are defined statically as minimal supported mode; * WC and WT fall back to UC-. pat_init() updates these values to support * more cache modes, WC and WT, when it is safe to do so. See pat_init() * for the details. Note, __early_ioremap() used during early boot-time * takes pgprot_t (pte encoding) and does not use these tables. * * Index into __cachemode2pte_tbl[] is the cachemode. * * Index into __pte2cachemode_tbl[] are the caching attribute bits of the pte * (_PAGE_PWT, _PAGE_PCD, _PAGE_PAT) at index bit positions 0, 1, 2. */ static uint16_t __cachemode2pte_tbl[_PAGE_CACHE_MODE_NUM] = { [_PAGE_CACHE_MODE_WB ] = 0 | 0 , [_PAGE_CACHE_MODE_WC ] = 0 | _PAGE_PCD, [_PAGE_CACHE_MODE_UC_MINUS] = 0 | _PAGE_PCD, [_PAGE_CACHE_MODE_UC ] = _PAGE_PWT | _PAGE_PCD, [_PAGE_CACHE_MODE_WT ] = 0 | _PAGE_PCD, [_PAGE_CACHE_MODE_WP ] = 0 | _PAGE_PCD, }; unsigned long cachemode2protval(enum page_cache_mode pcm) { if (likely(pcm == 0)) return 0; return __cachemode2pte_tbl[pcm]; } EXPORT_SYMBOL(cachemode2protval); static uint8_t __pte2cachemode_tbl[8] = { [__pte2cm_idx( 0 | 0 | 0 )] = _PAGE_CACHE_MODE_WB, [__pte2cm_idx(_PAGE_PWT | 0 | 0 )] = _PAGE_CACHE_MODE_UC_MINUS, [__pte2cm_idx( 0 | _PAGE_PCD | 0 )] = _PAGE_CACHE_MODE_UC_MINUS, [__pte2cm_idx(_PAGE_PWT | _PAGE_PCD | 0 )] = _PAGE_CACHE_MODE_UC, [__pte2cm_idx( 0 | 0 | _PAGE_PAT)] = _PAGE_CACHE_MODE_WB, [__pte2cm_idx(_PAGE_PWT | 0 | _PAGE_PAT)] = _PAGE_CACHE_MODE_UC_MINUS, [__pte2cm_idx(0 | _PAGE_PCD | _PAGE_PAT)] = _PAGE_CACHE_MODE_UC_MINUS, [__pte2cm_idx(_PAGE_PWT | _PAGE_PCD | _PAGE_PAT)] = _PAGE_CACHE_MODE_UC, }; /* * Check that the write-protect PAT entry is set for write-protect. * To do this without making assumptions how PAT has been set up (Xen has * another layout than the kernel), translate the _PAGE_CACHE_MODE_WP cache * mode via the __cachemode2pte_tbl[] into protection bits (those protection * bits will select a cache mode of WP or better), and then translate the * protection bits back into the cache mode using __pte2cm_idx() and the * __pte2cachemode_tbl[] array. This will return the really used cache mode. */ bool x86_has_pat_wp(void) { uint16_t prot = __cachemode2pte_tbl[_PAGE_CACHE_MODE_WP]; return __pte2cachemode_tbl[__pte2cm_idx(prot)] == _PAGE_CACHE_MODE_WP; } enum page_cache_mode pgprot2cachemode(pgprot_t pgprot) { unsigned long masked; masked = pgprot_val(pgprot) & _PAGE_CACHE_MASK; if (likely(masked == 0)) return 0; return __pte2cachemode_tbl[__pte2cm_idx(masked)]; } static unsigned long __initdata pgt_buf_start; static unsigned long __initdata pgt_buf_end; static unsigned long __initdata pgt_buf_top; static unsigned long min_pfn_mapped; static bool __initdata can_use_brk_pgt = true; /* * Pages returned are already directly mapped. * * Changing that is likely to break Xen, see commit: * * 279b706 x86,xen: introduce x86_init.mapping.pagetable_reserve * * for detailed information. */ __ref void *alloc_low_pages(unsigned int num) { unsigned long pfn; int i; if (after_bootmem) { unsigned int order; order = get_order((unsigned long)num << PAGE_SHIFT); return (void *)__get_free_pages(GFP_ATOMIC | __GFP_ZERO, order); } if ((pgt_buf_end + num) > pgt_buf_top || !can_use_brk_pgt) { unsigned long ret = 0; if (min_pfn_mapped < max_pfn_mapped) { ret = memblock_phys_alloc_range( PAGE_SIZE * num, PAGE_SIZE, min_pfn_mapped << PAGE_SHIFT, max_pfn_mapped << PAGE_SHIFT); } if (!ret && can_use_brk_pgt) ret = __pa(extend_brk(PAGE_SIZE * num, PAGE_SIZE)); if (!ret) panic("alloc_low_pages: can not alloc memory"); pfn = ret >> PAGE_SHIFT; } else { pfn = pgt_buf_end; pgt_buf_end += num; } for (i = 0; i < num; i++) { void *adr; adr = __va((pfn + i) << PAGE_SHIFT); clear_page(adr); } return __va(pfn << PAGE_SHIFT); } /* * By default need to be able to allocate page tables below PGD firstly for * the 0-ISA_END_ADDRESS range and secondly for the initial PMD_SIZE mapping. * With KASLR memory randomization, depending on the machine e820 memory and the * PUD alignment, twice that many pages may be needed when KASLR memory * randomization is enabled. */ #define INIT_PGD_PAGE_TABLES 4 #ifndef CONFIG_RANDOMIZE_MEMORY #define INIT_PGD_PAGE_COUNT (2 * INIT_PGD_PAGE_TABLES) #else #define INIT_PGD_PAGE_COUNT (4 * INIT_PGD_PAGE_TABLES) #endif #define INIT_PGT_BUF_SIZE (INIT_PGD_PAGE_COUNT * PAGE_SIZE) RESERVE_BRK(early_pgt_alloc, INIT_PGT_BUF_SIZE); void __init early_alloc_pgt_buf(void) { unsigned long tables = INIT_PGT_BUF_SIZE; phys_addr_t base; base = __pa(extend_brk(tables, PAGE_SIZE)); pgt_buf_start = base >> PAGE_SHIFT; pgt_buf_end = pgt_buf_start; pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT); } int after_bootmem; early_param_on_off("gbpages", "nogbpages", direct_gbpages, CONFIG_X86_DIRECT_GBPAGES); struct map_range { unsigned long start; unsigned long end; unsigned page_size_mask; }; static int page_size_mask; /* * Save some of cr4 feature set we're using (e.g. Pentium 4MB * enable and PPro Global page enable), so that any CPU's that boot * up after us can get the correct flags. Invoked on the boot CPU. */ static inline void cr4_set_bits_and_update_boot(unsigned long mask) { mmu_cr4_features |= mask; if (trampoline_cr4_features) *trampoline_cr4_features = mmu_cr4_features; cr4_set_bits(mask); } static void __init probe_page_size_mask(void) { /* * For pagealloc debugging, identity mapping will use small pages. * This will simplify cpa(), which otherwise needs to support splitting * large pages into small in interrupt context, etc. */ if (boot_cpu_has(X86_FEATURE_PSE) && !debug_pagealloc_enabled()) page_size_mask |= 1 << PG_LEVEL_2M; else direct_gbpages = 0; /* Enable PSE if available */ if (boot_cpu_has(X86_FEATURE_PSE)) cr4_set_bits_and_update_boot(X86_CR4_PSE); /* Enable PGE if available */ __supported_pte_mask &= ~_PAGE_GLOBAL; if (boot_cpu_has(X86_FEATURE_PGE)) { cr4_set_bits_and_update_boot(X86_CR4_PGE); __supported_pte_mask |= _PAGE_GLOBAL; } /* By the default is everything supported: */ __default_kernel_pte_mask = __supported_pte_mask; /* Except when with PTI where the kernel is mostly non-Global: */ if (cpu_feature_enabled(X86_FEATURE_PTI)) __default_kernel_pte_mask &= ~_PAGE_GLOBAL; /* Enable 1 GB linear kernel mappings if available: */ if (direct_gbpages && boot_cpu_has(X86_FEATURE_GBPAGES)) { printk(KERN_INFO "Using GB pages for direct mapping\n"); page_size_mask |= 1 << PG_LEVEL_1G; } else { direct_gbpages = 0; } } /* * INVLPG may not properly flush Global entries on * these CPUs. New microcode fixes the issue. */ static const struct x86_cpu_id invlpg_miss_ids[] = { X86_MATCH_VFM(INTEL_ALDERLAKE, 0x2e), X86_MATCH_VFM(INTEL_ALDERLAKE_L, 0x42c), X86_MATCH_VFM(INTEL_ATOM_GRACEMONT, 0x11), X86_MATCH_VFM(INTEL_RAPTORLAKE, 0x118), X86_MATCH_VFM(INTEL_RAPTORLAKE_P, 0x4117), X86_MATCH_VFM(INTEL_RAPTORLAKE_S, 0x2e), {} }; static void setup_pcid(void) { const struct x86_cpu_id *invlpg_miss_match; if (!IS_ENABLED(CONFIG_X86_64)) return; if (!boot_cpu_has(X86_FEATURE_PCID)) return; invlpg_miss_match = x86_match_cpu(invlpg_miss_ids); if (invlpg_miss_match && boot_cpu_data.microcode < invlpg_miss_match->driver_data) { pr_info("Incomplete global flushes, disabling PCID"); setup_clear_cpu_cap(X86_FEATURE_PCID); return; } if (boot_cpu_has(X86_FEATURE_PGE)) { /* * This can't be cr4_set_bits_and_update_boot() -- the * trampoline code can't handle CR4.PCIDE and it wouldn't * do any good anyway. Despite the name, * cr4_set_bits_and_update_boot() doesn't actually cause * the bits in question to remain set all the way through * the secondary boot asm. * * Instead, we brute-force it and set CR4.PCIDE manually in * start_secondary(). */ cr4_set_bits(X86_CR4_PCIDE); } else { /* * flush_tlb_all(), as currently implemented, won't work if * PCID is on but PGE is not. Since that combination * doesn't exist on real hardware, there's no reason to try * to fully support it, but it's polite to avoid corrupting * data if we're on an improperly configured VM. */ setup_clear_cpu_cap(X86_FEATURE_PCID); } } #ifdef CONFIG_X86_32 #define NR_RANGE_MR 3 #else /* CONFIG_X86_64 */ #define NR_RANGE_MR 5 #endif static int __meminit save_mr(struct map_range *mr, int nr_range, unsigned long start_pfn, unsigned long end_pfn, unsigned long page_size_mask) { if (start_pfn < end_pfn) { if (nr_range >= NR_RANGE_MR) panic("run out of range for init_memory_mapping\n"); mr[nr_range].start = start_pfn<<PAGE_SHIFT; mr[nr_range].end = end_pfn<<PAGE_SHIFT; mr[nr_range].page_size_mask = page_size_mask; nr_range++; } return nr_range; } /* * adjust the page_size_mask for small range to go with * big page size instead small one if nearby are ram too. */ static void __ref adjust_range_page_size_mask(struct map_range *mr, int nr_range) { int i; for (i = 0; i < nr_range; i++) { if ((page_size_mask & (1<<PG_LEVEL_2M)) && !(mr[i].page_size_mask & (1<<PG_LEVEL_2M))) { unsigned long start = round_down(mr[i].start, PMD_SIZE); unsigned long end = round_up(mr[i].end, PMD_SIZE); #ifdef CONFIG_X86_32 if ((end >> PAGE_SHIFT) > max_low_pfn) continue; #endif if (memblock_is_region_memory(start, end - start)) mr[i].page_size_mask |= 1<<PG_LEVEL_2M; } if ((page_size_mask & (1<<PG_LEVEL_1G)) && !(mr[i].page_size_mask & (1<<PG_LEVEL_1G))) { unsigned long start = round_down(mr[i].start, PUD_SIZE); unsigned long end = round_up(mr[i].end, PUD_SIZE); if (memblock_is_region_memory(start, end - start)) mr[i].page_size_mask |= 1<<PG_LEVEL_1G; } } } static const char *page_size_string(struct map_range *mr) { static const char str_1g[] = "1G"; static const char str_2m[] = "2M"; static const char str_4m[] = "4M"; static const char str_4k[] = "4k"; if (mr->page_size_mask & (1<<PG_LEVEL_1G)) return str_1g; /* * 32-bit without PAE has a 4M large page size. * PG_LEVEL_2M is misnamed, but we can at least * print out the right size in the string. */ if (IS_ENABLED(CONFIG_X86_32) && !IS_ENABLED(CONFIG_X86_PAE) && mr->page_size_mask & (1<<PG_LEVEL_2M)) return str_4m; if (mr->page_size_mask & (1<<PG_LEVEL_2M)) return str_2m; return str_4k; } static int __meminit split_mem_range(struct map_range *mr, int nr_range, unsigned long start, unsigned long end) { unsigned long start_pfn, end_pfn, limit_pfn; unsigned long pfn; int i; limit_pfn = PFN_DOWN(end); /* head if not big page alignment ? */ pfn = start_pfn = PFN_DOWN(start); #ifdef CONFIG_X86_32 /* * Don't use a large page for the first 2/4MB of memory * because there are often fixed size MTRRs in there * and overlapping MTRRs into large pages can cause * slowdowns. */ if (pfn == 0) end_pfn = PFN_DOWN(PMD_SIZE); else end_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE)); #else /* CONFIG_X86_64 */ end_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE)); #endif if (end_pfn > limit_pfn) end_pfn = limit_pfn; if (start_pfn < end_pfn) { nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, 0); pfn = end_pfn; } /* big page (2M) range */ start_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE)); #ifdef CONFIG_X86_32 end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE)); #else /* CONFIG_X86_64 */ end_pfn = round_up(pfn, PFN_DOWN(PUD_SIZE)); if (end_pfn > round_down(limit_pfn, PFN_DOWN(PMD_SIZE))) end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE)); #endif if (start_pfn < end_pfn) { nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, page_size_mask & (1<<PG_LEVEL_2M)); pfn = end_pfn; } #ifdef CONFIG_X86_64 /* big page (1G) range */ start_pfn = round_up(pfn, PFN_DOWN(PUD_SIZE)); end_pfn = round_down(limit_pfn, PFN_DOWN(PUD_SIZE)); if (start_pfn < end_pfn) { nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, page_size_mask & ((1<<PG_LEVEL_2M)|(1<<PG_LEVEL_1G))); pfn = end_pfn; } /* tail is not big page (1G) alignment */ start_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE)); end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE)); if (start_pfn < end_pfn) { nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, page_size_mask & (1<<PG_LEVEL_2M)); pfn = end_pfn; } #endif /* tail is not big page (2M) alignment */ start_pfn = pfn; end_pfn = limit_pfn; nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, 0); if (!after_bootmem) adjust_range_page_size_mask(mr, nr_range); /* try to merge same page size and continuous */ for (i = 0; nr_range > 1 && i < nr_range - 1; i++) { unsigned long old_start; if (mr[i].end != mr[i+1].start || mr[i].page_size_mask != mr[i+1].page_size_mask) continue; /* move it */ old_start = mr[i].start; memmove(&mr[i], &mr[i+1], (nr_range - 1 - i) * sizeof(struct map_range)); mr[i--].start = old_start; nr_range--; } for (i = 0; i < nr_range; i++) pr_debug(" [mem %#010lx-%#010lx] page %s\n", mr[i].start, mr[i].end - 1, page_size_string(&mr[i])); return nr_range; } struct range pfn_mapped[E820_MAX_ENTRIES]; int nr_pfn_mapped; static void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn) { nr_pfn_mapped = add_range_with_merge(pfn_mapped, E820_MAX_ENTRIES, nr_pfn_mapped, start_pfn, end_pfn); nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_MAX_ENTRIES); max_pfn_mapped = max(max_pfn_mapped, end_pfn); if (start_pfn < (1UL<<(32-PAGE_SHIFT))) max_low_pfn_mapped = max(max_low_pfn_mapped, min(end_pfn, 1UL<<(32-PAGE_SHIFT))); } bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn) { int i; for (i = 0; i < nr_pfn_mapped; i++) if ((start_pfn >= pfn_mapped[i].start) && (end_pfn <= pfn_mapped[i].end)) return true; return false; } /* * Setup the direct mapping of the physical memory at PAGE_OFFSET. * This runs before bootmem is initialized and gets pages directly from * the physical memory. To access them they are temporarily mapped. */ unsigned long __ref init_memory_mapping(unsigned long start, unsigned long end, pgprot_t prot) { struct map_range mr[NR_RANGE_MR]; unsigned long ret = 0; int nr_range, i; pr_debug("init_memory_mapping: [mem %#010lx-%#010lx]\n", start, end - 1); memset(mr, 0, sizeof(mr)); nr_range = split_mem_range(mr, 0, start, end); for (i = 0; i < nr_range; i++) ret = kernel_physical_mapping_init(mr[i].start, mr[i].end, mr[i].page_size_mask, prot); add_pfn_range_mapped(start >> PAGE_SHIFT, ret >> PAGE_SHIFT); return ret >> PAGE_SHIFT; } /* * We need to iterate through the E820 memory map and create direct mappings * for only E820_TYPE_RAM and E820_KERN_RESERVED regions. We cannot simply * create direct mappings for all pfns from [0 to max_low_pfn) and * [4GB to max_pfn) because of possible memory holes in high addresses * that cannot be marked as UC by fixed/variable range MTRRs. * Depending on the alignment of E820 ranges, this may possibly result * in using smaller size (i.e. 4K instead of 2M or 1G) page tables. * * init_mem_mapping() calls init_range_memory_mapping() with big range. * That range would have hole in the middle or ends, and only ram parts * will be mapped in init_range_memory_mapping(). */ static unsigned long __init init_range_memory_mapping( unsigned long r_start, unsigned long r_end) { unsigned long start_pfn, end_pfn; unsigned long mapped_ram_size = 0; int i; for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) { u64 start = clamp_val(PFN_PHYS(start_pfn), r_start, r_end); u64 end = clamp_val(PFN_PHYS(end_pfn), r_start, r_end); if (start >= end) continue; /* * if it is overlapping with brk pgt, we need to * alloc pgt buf from memblock instead. */ can_use_brk_pgt = max(start, (u64)pgt_buf_end<<PAGE_SHIFT) >= min(end, (u64)pgt_buf_top<<PAGE_SHIFT); init_memory_mapping(start, end, PAGE_KERNEL); mapped_ram_size += end - start; can_use_brk_pgt = true; } return mapped_ram_size; } static unsigned long __init get_new_step_size(unsigned long step_size) { /* * Initial mapped size is PMD_SIZE (2M). * We can not set step_size to be PUD_SIZE (1G) yet. * In worse case, when we cross the 1G boundary, and * PG_LEVEL_2M is not set, we will need 1+1+512 pages (2M + 8k) * to map 1G range with PTE. Hence we use one less than the * difference of page table level shifts. * * Don't need to worry about overflow in the top-down case, on 32bit, * when step_size is 0, round_down() returns 0 for start, and that * turns it into 0x100000000ULL. * In the bottom-up case, round_up(x, 0) returns 0 though too, which * needs to be taken into consideration by the code below. */ return step_size << (PMD_SHIFT - PAGE_SHIFT - 1); } /** * memory_map_top_down - Map [map_start, map_end) top down * @map_start: start address of the target memory range * @map_end: end address of the target memory range * * This function will setup direct mapping for memory range * [map_start, map_end) in top-down. That said, the page tables * will be allocated at the end of the memory, and we map the * memory in top-down. */ static void __init memory_map_top_down(unsigned long map_start, unsigned long map_end) { unsigned long real_end, last_start; unsigned long step_size; unsigned long addr; unsigned long mapped_ram_size = 0; /* * Systems that have many reserved areas near top of the memory, * e.g. QEMU with less than 1G RAM and EFI enabled, or Xen, will * require lots of 4K mappings which may exhaust pgt_buf. * Start with top-most PMD_SIZE range aligned at PMD_SIZE to ensure * there is enough mapped memory that can be allocated from * memblock. */ addr = memblock_phys_alloc_range(PMD_SIZE, PMD_SIZE, map_start, map_end); if (!addr) { pr_warn("Failed to release memory for alloc_low_pages()"); real_end = max(map_start, ALIGN_DOWN(map_end, PMD_SIZE)); } else { memblock_phys_free(addr, PMD_SIZE); real_end = addr + PMD_SIZE; } /* step_size need to be small so pgt_buf from BRK could cover it */ step_size = PMD_SIZE; max_pfn_mapped = 0; /* will get exact value next */ min_pfn_mapped = real_end >> PAGE_SHIFT; last_start = real_end; /* * We start from the top (end of memory) and go to the bottom. * The memblock_find_in_range() gets us a block of RAM from the * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages * for page table. */ while (last_start > map_start) { unsigned long start; if (last_start > step_size) { start = round_down(last_start - 1, step_size); if (start < map_start) start = map_start; } else start = map_start; mapped_ram_size += init_range_memory_mapping(start, last_start); last_start = start; min_pfn_mapped = last_start >> PAGE_SHIFT; if (mapped_ram_size >= step_size) step_size = get_new_step_size(step_size); } if (real_end < map_end) init_range_memory_mapping(real_end, map_end); } /** * memory_map_bottom_up - Map [map_start, map_end) bottom up * @map_start: start address of the target memory range * @map_end: end address of the target memory range * * This function will setup direct mapping for memory range * [map_start, map_end) in bottom-up. Since we have limited the * bottom-up allocation above the kernel, the page tables will * be allocated just above the kernel and we map the memory * in [map_start, map_end) in bottom-up. */ static void __init memory_map_bottom_up(unsigned long map_start, unsigned long map_end) { unsigned long next, start; unsigned long mapped_ram_size = 0; /* step_size need to be small so pgt_buf from BRK could cover it */ unsigned long step_size = PMD_SIZE; start = map_start; min_pfn_mapped = start >> PAGE_SHIFT; /* * We start from the bottom (@map_start) and go to the top (@map_end). * The memblock_find_in_range() gets us a block of RAM from the * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages * for page table. */ while (start < map_end) { if (step_size && map_end - start > step_size) { next = round_up(start + 1, step_size); if (next > map_end) next = map_end; } else { next = map_end; } mapped_ram_size += init_range_memory_mapping(start, next); start = next; if (mapped_ram_size >= step_size) step_size = get_new_step_size(step_size); } } /* * The real mode trampoline, which is required for bootstrapping CPUs * occupies only a small area under the low 1MB. See reserve_real_mode() * for details. * * If KASLR is disabled the first PGD entry of the direct mapping is copied * to map the real mode trampoline. * * If KASLR is enabled, copy only the PUD which covers the low 1MB * area. This limits the randomization granularity to 1GB for both 4-level * and 5-level paging. */ static void __init init_trampoline(void) { #ifdef CONFIG_X86_64 /* * The code below will alias kernel page-tables in the user-range of the * address space, including the Global bit. So global TLB entries will * be created when using the trampoline page-table. */ if (!kaslr_memory_enabled()) trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)]; else init_trampoline_kaslr(); #endif } void __init init_mem_mapping(void) { unsigned long end; pti_check_boottime_disable(); probe_page_size_mask(); setup_pcid(); #ifdef CONFIG_X86_64 end = max_pfn << PAGE_SHIFT; #else end = max_low_pfn << PAGE_SHIFT; #endif /* the ISA range is always mapped regardless of memory holes */ init_memory_mapping(0, ISA_END_ADDRESS, PAGE_KERNEL); /* Init the trampoline, possibly with KASLR memory offset */ init_trampoline(); /* * If the allocation is in bottom-up direction, we setup direct mapping * in bottom-up, otherwise we setup direct mapping in top-down. */ if (memblock_bottom_up()) { unsigned long kernel_end = __pa_symbol(_end); /* * we need two separate calls here. This is because we want to * allocate page tables above the kernel. So we first map * [kernel_end, end) to make memory above the kernel be mapped * as soon as possible. And then use page tables allocated above * the kernel to map [ISA_END_ADDRESS, kernel_end). */ memory_map_bottom_up(kernel_end, end); memory_map_bottom_up(ISA_END_ADDRESS, kernel_end); } else { memory_map_top_down(ISA_END_ADDRESS, end); } #ifdef CONFIG_X86_64 if (max_pfn > max_low_pfn) { /* can we preserve max_low_pfn ?*/ max_low_pfn = max_pfn; } #else early_ioremap_page_table_range_init(); #endif load_cr3(swapper_pg_dir); __flush_tlb_all(); x86_init.hyper.init_mem_mapping(); early_memtest(0, max_pfn_mapped << PAGE_SHIFT); } /* * Initialize an mm_struct to be used during poking and a pointer to be used * during patching. */ void __init poking_init(void) { spinlock_t *ptl; pte_t *ptep; text_poke_mm = mm_alloc(); BUG_ON(!text_poke_mm); /* Xen PV guests need the PGD to be pinned. */ paravirt_enter_mmap(text_poke_mm); set_notrack_mm(text_poke_mm); /* * Randomize the poking address, but make sure that the following page * will be mapped at the same PMD. We need 2 pages, so find space for 3, * and adjust the address if the PMD ends after the first one. */ text_poke_mm_addr = TASK_UNMAPPED_BASE; if (IS_ENABLED(CONFIG_RANDOMIZE_BASE)) text_poke_mm_addr += (kaslr_get_random_long("Poking") & PAGE_MASK) % (TASK_SIZE - TASK_UNMAPPED_BASE - 3 * PAGE_SIZE); if (((text_poke_mm_addr + PAGE_SIZE) & ~PMD_MASK) == 0) text_poke_mm_addr += PAGE_SIZE; /* * We need to trigger the allocation of the page-tables that will be * needed for poking now. Later, poking may be performed in an atomic * section, which might cause allocation to fail. */ ptep = get_locked_pte(text_poke_mm, text_poke_mm_addr, &ptl); BUG_ON(!ptep); pte_unmap_unlock(ptep, ptl); } /* * devmem_is_allowed() checks to see if /dev/mem access to a certain address * is valid. The argument is a physical page number. * * On x86, access has to be given to the first megabyte of RAM because that * area traditionally contains BIOS code and data regions used by X, dosemu, * and similar apps. Since they map the entire memory range, the whole range * must be allowed (for mapping), but any areas that would otherwise be * disallowed are flagged as being "zero filled" instead of rejected. * Access has to be given to non-kernel-ram areas as well, these contain the * PCI mmio resources as well as potential bios/acpi data regions. */ int devmem_is_allowed(unsigned long pagenr) { if (region_intersects(PFN_PHYS(pagenr), PAGE_SIZE, IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE) != REGION_DISJOINT) { /* * For disallowed memory regions in the low 1MB range, * request that the page be shown as all zeros. */ if (pagenr < 256) return 2; return 0; } /* * This must follow RAM test, since System RAM is considered a * restricted resource under CONFIG_STRICT_DEVMEM. */ if (iomem_is_exclusive(pagenr << PAGE_SHIFT)) { /* Low 1MB bypasses iomem restrictions. */ if (pagenr < 256) return 1; return 0; } return 1; } void free_init_pages(const char *what, unsigned long begin, unsigned long end) { unsigned long begin_aligned, end_aligned; /* Make sure boundaries are page aligned */ begin_aligned = PAGE_ALIGN(begin); end_aligned = end & PAGE_MASK; if (WARN_ON(begin_aligned != begin || end_aligned != end)) { begin = begin_aligned; end = end_aligned; } if (begin >= end) return; /* * If debugging page accesses then do not free this memory but * mark them not present - any buggy init-section access will * create a kernel page fault: */ if (debug_pagealloc_enabled()) { pr_info("debug: unmapping init [mem %#010lx-%#010lx]\n", begin, end - 1); /* * Inform kmemleak about the hole in the memory since the * corresponding pages will be unmapped. */ kmemleak_free_part((void *)begin, end - begin); set_memory_np(begin, (end - begin) >> PAGE_SHIFT); } else { /* * We just marked the kernel text read only above, now that * we are going to free part of that, we need to make that * writeable and non-executable first. */ set_memory_nx(begin, (end - begin) >> PAGE_SHIFT); set_memory_rw(begin, (end - begin) >> PAGE_SHIFT); free_reserved_area((void *)begin, (void *)end, POISON_FREE_INITMEM, what); } } /* * begin/end can be in the direct map or the "high kernel mapping" * used for the kernel image only. free_init_pages() will do the * right thing for either kind of address. */ void free_kernel_image_pages(const char *what, void *begin, void *end) { unsigned long begin_ul = (unsigned long)begin; unsigned long end_ul = (unsigned long)end; unsigned long len_pages = (end_ul - begin_ul) >> PAGE_SHIFT; free_init_pages(what, begin_ul, end_ul); /* * PTI maps some of the kernel into userspace. For performance, * this includes some kernel areas that do not contain secrets. * Those areas might be adjacent to the parts of the kernel image * being freed, which may contain secrets. Remove the "high kernel * image mapping" for these freed areas, ensuring they are not even * potentially vulnerable to Meltdown regardless of the specific * optimizations PTI is currently using. * * The "noalias" prevents unmapping the direct map alias which is * needed to access the freed pages. * * This is only valid for 64bit kernels. 32bit has only one mapping * which can't be treated in this way for obvious reasons. */ if (IS_ENABLED(CONFIG_X86_64) && cpu_feature_enabled(X86_FEATURE_PTI)) set_memory_np_noalias(begin_ul, len_pages); } void __ref free_initmem(void) { e820__reallocate_tables(); mem_encrypt_free_decrypted_mem(); free_kernel_image_pages("unused kernel image (initmem)", &__init_begin, &__init_end); } #ifdef CONFIG_BLK_DEV_INITRD void __init free_initrd_mem(unsigned long start, unsigned long end) { /* * end could be not aligned, and We can not align that, * decompressor could be confused by aligned initrd_end * We already reserve the end partial page before in * - i386_start_kernel() * - x86_64_start_kernel() * - relocate_initrd() * So here We can do PAGE_ALIGN() safely to get partial page to be freed */ free_init_pages("initrd", start, PAGE_ALIGN(end)); } #endif void __init zone_sizes_init(void) { unsigned long max_zone_pfns[MAX_NR_ZONES]; memset(max_zone_pfns, 0, sizeof(max_zone_pfns)); #ifdef CONFIG_ZONE_DMA max_zone_pfns[ZONE_DMA] = min(MAX_DMA_PFN, max_low_pfn); #endif #ifdef CONFIG_ZONE_DMA32 max_zone_pfns[ZONE_DMA32] = min(MAX_DMA32_PFN, max_low_pfn); #endif max_zone_pfns[ZONE_NORMAL] = max_low_pfn; #ifdef CONFIG_HIGHMEM max_zone_pfns[ZONE_HIGHMEM] = max_pfn; #endif free_area_init(max_zone_pfns); } __visible DEFINE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate) = { .loaded_mm = &init_mm, .next_asid = 1, .cr4 = ~0UL, /* fail hard if we screw up cr4 shadow initialization */ }; #ifdef CONFIG_ADDRESS_MASKING DEFINE_PER_CPU(u64, tlbstate_untag_mask); EXPORT_PER_CPU_SYMBOL(tlbstate_untag_mask); #endif void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache) { /* entry 0 MUST be WB (hardwired to speed up translations) */ BUG_ON(!entry && cache != _PAGE_CACHE_MODE_WB); __cachemode2pte_tbl[cache] = __cm_idx2pte(entry); __pte2cachemode_tbl[entry] = cache; } #ifdef CONFIG_SWAP unsigned long arch_max_swapfile_size(void) { unsigned long pages; pages = generic_max_swapfile_size(); if (boot_cpu_has_bug(X86_BUG_L1TF) && l1tf_mitigation != L1TF_MITIGATION_OFF) { /* Limit the swap file size to MAX_PA/2 for L1TF workaround */ unsigned long long l1tf_limit = l1tf_pfn_limit(); /* * We encode swap offsets also with 3 bits below those for pfn * which makes the usable limit higher. */ #if CONFIG_PGTABLE_LEVELS > 2 l1tf_limit <<= PAGE_SHIFT - SWP_OFFSET_FIRST_BIT; #endif pages = min_t(unsigned long long, l1tf_limit, pages); } return pages; } #endif #ifdef CONFIG_EXECMEM static struct execmem_info execmem_info __ro_after_init; #ifdef CONFIG_ARCH_HAS_EXECMEM_ROX void execmem_fill_trapping_insns(void *ptr, size_t size) { memset(ptr, INT3_INSN_OPCODE, size); } #endif struct execmem_info __init *execmem_arch_setup(void) { unsigned long start, offset = 0; enum execmem_range_flags flags; pgprot_t pgprot; if (kaslr_enabled()) offset = get_random_u32_inclusive(1, 1024) * PAGE_SIZE; start = MODULES_VADDR + offset; if (IS_ENABLED(CONFIG_ARCH_HAS_EXECMEM_ROX) && cpu_feature_enabled(X86_FEATURE_PSE)) { pgprot = PAGE_KERNEL_ROX; flags = EXECMEM_KASAN_SHADOW | EXECMEM_ROX_CACHE; } else { pgprot = PAGE_KERNEL; flags = EXECMEM_KASAN_SHADOW; } execmem_info = (struct execmem_info){ .ranges = { [EXECMEM_MODULE_TEXT] = { .flags = flags, .start = start, .end = MODULES_END, .pgprot = pgprot, .alignment = MODULE_ALIGN, }, [EXECMEM_KPROBES] = { .flags = flags, .start = start, .end = MODULES_END, .pgprot = PAGE_KERNEL_ROX, .alignment = MODULE_ALIGN, }, [EXECMEM_FTRACE] = { .flags = flags, .start = start, .end = MODULES_END, .pgprot = pgprot, .alignment = MODULE_ALIGN, }, [EXECMEM_BPF] = { .flags = EXECMEM_KASAN_SHADOW, .start = start, .end = MODULES_END, .pgprot = PAGE_KERNEL, .alignment = MODULE_ALIGN, }, [EXECMEM_MODULE_DATA] = { .flags = EXECMEM_KASAN_SHADOW, .start = start, .end = MODULES_END, .pgprot = PAGE_KERNEL, .alignment = MODULE_ALIGN, }, }, }; return &execmem_info; } #endif /* CONFIG_EXECMEM */ |
| 305 229 188 63 54 102 189 190 108 172 187 187 187 187 187 187 187 1 1 47 127 188 58 174 86 187 13 2297 305 63 39 15 156 32 188 188 2624 2627 102 2626 2599 186 305 47 39 266 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 1991-1998 Linus Torvalds * Re-organised Feb 1998 Russell King * Copyright (C) 2020 Christoph Hellwig */ #include <linux/fs.h> #include <linux/major.h> #include <linux/slab.h> #include <linux/ctype.h> #include <linux/vmalloc.h> #include <linux/raid/detect.h> #include "check.h" static int (*const check_part[])(struct parsed_partitions *) = { /* * Probe partition formats with tables at disk address 0 * that also have an ADFS boot block at 0xdc0. */ #ifdef CONFIG_ACORN_PARTITION_ICS adfspart_check_ICS, #endif #ifdef CONFIG_ACORN_PARTITION_POWERTEC adfspart_check_POWERTEC, #endif #ifdef CONFIG_ACORN_PARTITION_EESOX adfspart_check_EESOX, #endif /* * Now move on to formats that only have partition info at * disk address 0xdc0. Since these may also have stale * PC/BIOS partition tables, they need to come before * the msdos entry. */ #ifdef CONFIG_ACORN_PARTITION_CUMANA adfspart_check_CUMANA, #endif #ifdef CONFIG_ACORN_PARTITION_ADFS adfspart_check_ADFS, #endif #ifdef CONFIG_CMDLINE_PARTITION cmdline_partition, #endif #ifdef CONFIG_OF_PARTITION of_partition, /* cmdline have priority to OF */ #endif #ifdef CONFIG_EFI_PARTITION efi_partition, /* this must come before msdos */ #endif #ifdef CONFIG_SGI_PARTITION sgi_partition, #endif #ifdef CONFIG_LDM_PARTITION ldm_partition, /* this must come before msdos */ #endif #ifdef CONFIG_MSDOS_PARTITION msdos_partition, #endif #ifdef CONFIG_OSF_PARTITION osf_partition, #endif #ifdef CONFIG_SUN_PARTITION sun_partition, #endif #ifdef CONFIG_AMIGA_PARTITION amiga_partition, #endif #ifdef CONFIG_ATARI_PARTITION atari_partition, #endif #ifdef CONFIG_MAC_PARTITION mac_partition, #endif #ifdef CONFIG_ULTRIX_PARTITION ultrix_partition, #endif #ifdef CONFIG_IBM_PARTITION ibm_partition, #endif #ifdef CONFIG_KARMA_PARTITION karma_partition, #endif #ifdef CONFIG_SYSV68_PARTITION sysv68_partition, #endif NULL }; static struct parsed_partitions *allocate_partitions(struct gendisk *hd) { struct parsed_partitions *state; int nr = DISK_MAX_PARTS; state = kzalloc(sizeof(*state), GFP_KERNEL); if (!state) return NULL; state->parts = vzalloc(array_size(nr, sizeof(state->parts[0]))); if (!state->parts) { kfree(state); return NULL; } state->limit = nr; return state; } static void free_partitions(struct parsed_partitions *state) { vfree(state->parts); kfree(state); } static struct parsed_partitions *check_partition(struct gendisk *hd) { struct parsed_partitions *state; int i, res, err; state = allocate_partitions(hd); if (!state) return NULL; state->pp_buf = (char *)__get_free_page(GFP_KERNEL); if (!state->pp_buf) { free_partitions(state); return NULL; } state->pp_buf[0] = '\0'; state->disk = hd; snprintf(state->name, BDEVNAME_SIZE, "%s", hd->disk_name); snprintf(state->pp_buf, PAGE_SIZE, " %s:", state->name); if (isdigit(state->name[strlen(state->name)-1])) sprintf(state->name, "p"); i = res = err = 0; while (!res && check_part[i]) { memset(state->parts, 0, state->limit * sizeof(state->parts[0])); res = check_part[i++](state); if (res < 0) { /* * We have hit an I/O error which we don't report now. * But record it, and let the others do their job. */ err = res; res = 0; } } if (res > 0) { printk(KERN_INFO "%s", state->pp_buf); free_page((unsigned long)state->pp_buf); return state; } if (state->access_beyond_eod) err = -ENOSPC; /* * The partition is unrecognized. So report I/O errors if there were any */ if (err) res = err; if (res) { strlcat(state->pp_buf, " unable to read partition table\n", PAGE_SIZE); printk(KERN_INFO "%s", state->pp_buf); } free_page((unsigned long)state->pp_buf); free_partitions(state); return ERR_PTR(res); } static ssize_t part_partition_show(struct device *dev, struct device_attribute *attr, char *buf) { return sprintf(buf, "%d\n", bdev_partno(dev_to_bdev(dev))); } static ssize_t part_start_show(struct device *dev, struct device_attribute *attr, char *buf) { return sprintf(buf, "%llu\n", dev_to_bdev(dev)->bd_start_sect); } static ssize_t part_ro_show(struct device *dev, struct device_attribute *attr, char *buf) { return sprintf(buf, "%d\n", bdev_read_only(dev_to_bdev(dev))); } static ssize_t part_alignment_offset_show(struct device *dev, struct device_attribute *attr, char *buf) { return sprintf(buf, "%u\n", bdev_alignment_offset(dev_to_bdev(dev))); } static ssize_t part_discard_alignment_show(struct device *dev, struct device_attribute *attr, char *buf) { return sprintf(buf, "%u\n", bdev_discard_alignment(dev_to_bdev(dev))); } static DEVICE_ATTR(partition, 0444, part_partition_show, NULL); static DEVICE_ATTR(start, 0444, part_start_show, NULL); static DEVICE_ATTR(size, 0444, part_size_show, NULL); static DEVICE_ATTR(ro, 0444, part_ro_show, NULL); static DEVICE_ATTR(alignment_offset, 0444, part_alignment_offset_show, NULL); static DEVICE_ATTR(discard_alignment, 0444, part_discard_alignment_show, NULL); static DEVICE_ATTR(stat, 0444, part_stat_show, NULL); static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL); #ifdef CONFIG_FAIL_MAKE_REQUEST static struct device_attribute dev_attr_fail = __ATTR(make-it-fail, 0644, part_fail_show, part_fail_store); #endif static struct attribute *part_attrs[] = { &dev_attr_partition.attr, &dev_attr_start.attr, &dev_attr_size.attr, &dev_attr_ro.attr, &dev_attr_alignment_offset.attr, &dev_attr_discard_alignment.attr, &dev_attr_stat.attr, &dev_attr_inflight.attr, #ifdef CONFIG_FAIL_MAKE_REQUEST &dev_attr_fail.attr, #endif NULL }; static const struct attribute_group part_attr_group = { .attrs = part_attrs, }; static const struct attribute_group *part_attr_groups[] = { &part_attr_group, #ifdef CONFIG_BLK_DEV_IO_TRACE &blk_trace_attr_group, #endif NULL }; static void part_release(struct device *dev) { put_disk(dev_to_bdev(dev)->bd_disk); bdev_drop(dev_to_bdev(dev)); } static int part_uevent(const struct device *dev, struct kobj_uevent_env *env) { const struct block_device *part = dev_to_bdev(dev); add_uevent_var(env, "PARTN=%u", bdev_partno(part)); if (part->bd_meta_info && part->bd_meta_info->volname[0]) add_uevent_var(env, "PARTNAME=%s", part->bd_meta_info->volname); if (part->bd_meta_info && part->bd_meta_info->uuid[0]) add_uevent_var(env, "PARTUUID=%s", part->bd_meta_info->uuid); return 0; } const struct device_type part_type = { .name = "partition", .groups = part_attr_groups, .release = part_release, .uevent = part_uevent, }; void drop_partition(struct block_device *part) { lockdep_assert_held(&part->bd_disk->open_mutex); xa_erase(&part->bd_disk->part_tbl, bdev_partno(part)); kobject_put(part->bd_holder_dir); device_del(&part->bd_device); put_device(&part->bd_device); } static ssize_t whole_disk_show(struct device *dev, struct device_attribute *attr, char *buf) { return 0; } static const DEVICE_ATTR(whole_disk, 0444, whole_disk_show, NULL); /* * Must be called either with open_mutex held, before a disk can be opened or * after all disk users are gone. */ static struct block_device *add_partition(struct gendisk *disk, int partno, sector_t start, sector_t len, int flags, struct partition_meta_info *info) { dev_t devt = MKDEV(0, 0); struct device *ddev = disk_to_dev(disk); struct device *pdev; struct block_device *bdev; const char *dname; int err; lockdep_assert_held(&disk->open_mutex); if (partno >= DISK_MAX_PARTS) return ERR_PTR(-EINVAL); /* * Partitions are not supported on zoned block devices that are used as * such. */ if (bdev_is_zoned(disk->part0)) { pr_warn("%s: partitions not supported on host managed zoned block device\n", disk->disk_name); return ERR_PTR(-ENXIO); } if (xa_load(&disk->part_tbl, partno)) return ERR_PTR(-EBUSY); /* ensure we always have a reference to the whole disk */ get_device(disk_to_dev(disk)); err = -ENOMEM; bdev = bdev_alloc(disk, partno); if (!bdev) goto out_put_disk; bdev->bd_start_sect = start; bdev_set_nr_sectors(bdev, len); pdev = &bdev->bd_device; dname = dev_name(ddev); if (isdigit(dname[strlen(dname) - 1])) dev_set_name(pdev, "%sp%d", dname, partno); else dev_set_name(pdev, "%s%d", dname, partno); device_initialize(pdev); pdev->class = &block_class; pdev->type = &part_type; pdev->parent = ddev; /* in consecutive minor range? */ if (bdev_partno(bdev) < disk->minors) { devt = MKDEV(disk->major, disk->first_minor + bdev_partno(bdev)); } else { err = blk_alloc_ext_minor(); if (err < 0) goto out_put; devt = MKDEV(BLOCK_EXT_MAJOR, err); } pdev->devt = devt; if (info) { err = -ENOMEM; bdev->bd_meta_info = kmemdup(info, sizeof(*info), GFP_KERNEL); if (!bdev->bd_meta_info) goto out_put; } /* delay uevent until 'holders' subdir is created */ dev_set_uevent_suppress(pdev, 1); err = device_add(pdev); if (err) goto out_put; err = -ENOMEM; bdev->bd_holder_dir = kobject_create_and_add("holders", &pdev->kobj); if (!bdev->bd_holder_dir) goto out_del; dev_set_uevent_suppress(pdev, 0); if (flags & ADDPART_FLAG_WHOLEDISK) { err = device_create_file(pdev, &dev_attr_whole_disk); if (err) goto out_del; } if (flags & ADDPART_FLAG_READONLY) bdev_set_flag(bdev, BD_READ_ONLY); /* everything is up and running, commence */ err = xa_insert(&disk->part_tbl, partno, bdev, GFP_KERNEL); if (err) goto out_del; bdev_add(bdev, devt); /* suppress uevent if the disk suppresses it */ if (!dev_get_uevent_suppress(ddev)) kobject_uevent(&pdev->kobj, KOBJ_ADD); return bdev; out_del: kobject_put(bdev->bd_holder_dir); device_del(pdev); out_put: put_device(pdev); return ERR_PTR(err); out_put_disk: put_disk(disk); return ERR_PTR(err); } static bool partition_overlaps(struct gendisk *disk, sector_t start, sector_t length, int skip_partno) { struct block_device *part; bool overlap = false; unsigned long idx; rcu_read_lock(); xa_for_each_start(&disk->part_tbl, idx, part, 1) { if (bdev_partno(part) != skip_partno && start < part->bd_start_sect + bdev_nr_sectors(part) && start + length > part->bd_start_sect) { overlap = true; break; } } rcu_read_unlock(); return overlap; } int bdev_add_partition(struct gendisk *disk, int partno, sector_t start, sector_t length) { struct block_device *part; int ret; mutex_lock(&disk->open_mutex); if (!disk_live(disk)) { ret = -ENXIO; goto out; } if (disk->flags & GENHD_FL_NO_PART) { ret = -EINVAL; goto out; } if (partition_overlaps(disk, start, length, -1)) { ret = -EBUSY; goto out; } part = add_partition(disk, partno, start, length, ADDPART_FLAG_NONE, NULL); ret = PTR_ERR_OR_ZERO(part); out: mutex_unlock(&disk->open_mutex); return ret; } int bdev_del_partition(struct gendisk *disk, int partno) { struct block_device *part = NULL; int ret = -ENXIO; mutex_lock(&disk->open_mutex); part = xa_load(&disk->part_tbl, partno); if (!part) goto out_unlock; ret = -EBUSY; if (atomic_read(&part->bd_openers)) goto out_unlock; /* * We verified that @part->bd_openers is zero above and so * @part->bd_holder{_ops} can't be set. And since we hold * @disk->open_mutex the device can't be claimed by anyone. * * So no need to call @part->bd_holder_ops->mark_dead() here. * Just delete the partition and invalidate it. */ bdev_unhash(part); invalidate_bdev(part); drop_partition(part); ret = 0; out_unlock: mutex_unlock(&disk->open_mutex); return ret; } int bdev_resize_partition(struct gendisk *disk, int partno, sector_t start, sector_t length) { struct block_device *part = NULL; int ret = -ENXIO; mutex_lock(&disk->open_mutex); part = xa_load(&disk->part_tbl, partno); if (!part) goto out_unlock; ret = -EINVAL; if (start != part->bd_start_sect) goto out_unlock; ret = -EBUSY; if (partition_overlaps(disk, start, length, partno)) goto out_unlock; bdev_set_nr_sectors(part, length); ret = 0; out_unlock: mutex_unlock(&disk->open_mutex); return ret; } static bool disk_unlock_native_capacity(struct gendisk *disk) { if (!disk->fops->unlock_native_capacity || test_and_set_bit(GD_NATIVE_CAPACITY, &disk->state)) { printk(KERN_CONT "truncated\n"); return false; } printk(KERN_CONT "enabling native capacity\n"); disk->fops->unlock_native_capacity(disk); return true; } static bool blk_add_partition(struct gendisk *disk, struct parsed_partitions *state, int p) { sector_t size = state->parts[p].size; sector_t from = state->parts[p].from; struct block_device *part; if (!size) return true; if (from >= get_capacity(disk)) { printk(KERN_WARNING "%s: p%d start %llu is beyond EOD, ", disk->disk_name, p, (unsigned long long) from); if (disk_unlock_native_capacity(disk)) return false; return true; } if (from + size > get_capacity(disk)) { printk(KERN_WARNING "%s: p%d size %llu extends beyond EOD, ", disk->disk_name, p, (unsigned long long) size); if (disk_unlock_native_capacity(disk)) return false; /* * We can not ignore partitions of broken tables created by for * example camera firmware, but we limit them to the end of the * disk to avoid creating invalid block devices. */ size = get_capacity(disk) - from; } part = add_partition(disk, p, from, size, state->parts[p].flags, &state->parts[p].info); if (IS_ERR(part)) { if (PTR_ERR(part) != -ENXIO) { printk(KERN_ERR " %s: p%d could not be added: %pe\n", disk->disk_name, p, part); } return true; } if (IS_BUILTIN(CONFIG_BLK_DEV_MD) && (state->parts[p].flags & ADDPART_FLAG_RAID)) md_autodetect_dev(part->bd_dev); return true; } static int blk_add_partitions(struct gendisk *disk) { struct parsed_partitions *state; int ret = -EAGAIN, p; if (!disk_has_partscan(disk)) return 0; state = check_partition(disk); if (!state) return 0; if (IS_ERR(state)) { /* * I/O error reading the partition table. If we tried to read * beyond EOD, retry after unlocking the native capacity. */ if (PTR_ERR(state) == -ENOSPC) { printk(KERN_WARNING "%s: partition table beyond EOD, ", disk->disk_name); if (disk_unlock_native_capacity(disk)) return -EAGAIN; } return -EIO; } /* * Partitions are not supported on host managed zoned block devices. */ if (bdev_is_zoned(disk->part0)) { pr_warn("%s: ignoring partition table on host managed zoned block device\n", disk->disk_name); ret = 0; goto out_free_state; } /* * If we read beyond EOD, try unlocking native capacity even if the * partition table was successfully read as we could be missing some * partitions. */ if (state->access_beyond_eod) { printk(KERN_WARNING "%s: partition table partially beyond EOD, ", disk->disk_name); if (disk_unlock_native_capacity(disk)) goto out_free_state; } /* tell userspace that the media / partition table may have changed */ kobject_uevent(&disk_to_dev(disk)->kobj, KOBJ_CHANGE); for (p = 1; p < state->limit; p++) if (!blk_add_partition(disk, state, p)) goto out_free_state; ret = 0; out_free_state: free_partitions(state); return ret; } int bdev_disk_changed(struct gendisk *disk, bool invalidate) { struct block_device *part; unsigned long idx; int ret = 0; lockdep_assert_held(&disk->open_mutex); if (!disk_live(disk)) return -ENXIO; rescan: if (disk->open_partitions) return -EBUSY; sync_blockdev(disk->part0); invalidate_bdev(disk->part0); xa_for_each_start(&disk->part_tbl, idx, part, 1) { /* * Remove the block device from the inode hash, so that * it cannot be looked up any more even when openers * still hold references. */ bdev_unhash(part); /* * If @disk->open_partitions isn't elevated but there's * still an active holder of that block device things * are broken. */ WARN_ON_ONCE(atomic_read(&part->bd_openers)); invalidate_bdev(part); drop_partition(part); } clear_bit(GD_NEED_PART_SCAN, &disk->state); /* * Historically we only set the capacity to zero for devices that * support partitions (independ of actually having partitions created). * Doing that is rather inconsistent, but changing it broke legacy * udisks polling for legacy ide-cdrom devices. Use the crude check * below to get the sane behavior for most device while not breaking * userspace for this particular setup. */ if (invalidate) { if (!(disk->flags & GENHD_FL_NO_PART) || !(disk->flags & GENHD_FL_REMOVABLE)) set_capacity(disk, 0); } if (get_capacity(disk)) { ret = blk_add_partitions(disk); if (ret == -EAGAIN) goto rescan; } else if (invalidate) { /* * Tell userspace that the media / partition table may have * changed. */ kobject_uevent(&disk_to_dev(disk)->kobj, KOBJ_CHANGE); } return ret; } /* * Only exported for loop and dasd for historic reasons. Don't use in new * code! */ EXPORT_SYMBOL_GPL(bdev_disk_changed); void *read_part_sector(struct parsed_partitions *state, sector_t n, Sector *p) { struct address_space *mapping = state->disk->part0->bd_mapping; struct folio *folio; if (n >= get_capacity(state->disk)) { state->access_beyond_eod = true; goto out; } folio = read_mapping_folio(mapping, n >> PAGE_SECTORS_SHIFT, NULL); if (IS_ERR(folio)) goto out; p->v = folio; return folio_address(folio) + offset_in_folio(folio, n * SECTOR_SIZE); out: p->v = NULL; return NULL; } |
| 876 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Block data types and constants. Directly include this file only to * break include dependency loop. */ #ifndef __LINUX_BLK_TYPES_H #define __LINUX_BLK_TYPES_H #include <linux/types.h> #include <linux/bvec.h> #include <linux/device.h> #include <linux/ktime.h> #include <linux/rw_hint.h> struct bio_set; struct bio; struct bio_integrity_payload; struct page; struct io_context; struct cgroup_subsys_state; typedef void (bio_end_io_t) (struct bio *); struct bio_crypt_ctx; /* * The basic unit of block I/O is a sector. It is used in a number of contexts * in Linux (blk, bio, genhd). The size of one sector is 512 = 2**9 * bytes. Variables of type sector_t represent an offset or size that is a * multiple of 512 bytes. Hence these two constants. */ #ifndef SECTOR_SHIFT #define SECTOR_SHIFT 9 #endif #ifndef SECTOR_SIZE #define SECTOR_SIZE (1 << SECTOR_SHIFT) #endif #define PAGE_SECTORS_SHIFT (PAGE_SHIFT - SECTOR_SHIFT) #define PAGE_SECTORS (1 << PAGE_SECTORS_SHIFT) #define SECTOR_MASK (PAGE_SECTORS - 1) struct block_device { sector_t bd_start_sect; sector_t bd_nr_sectors; struct gendisk * bd_disk; struct request_queue * bd_queue; struct disk_stats __percpu *bd_stats; unsigned long bd_stamp; atomic_t __bd_flags; // partition number + flags #define BD_PARTNO 255 // lower 8 bits; assign-once #define BD_READ_ONLY (1u<<8) // read-only policy #define BD_WRITE_HOLDER (1u<<9) #define BD_HAS_SUBMIT_BIO (1u<<10) #define BD_RO_WARNED (1u<<11) #ifdef CONFIG_FAIL_MAKE_REQUEST #define BD_MAKE_IT_FAIL (1u<<12) #endif dev_t bd_dev; struct address_space *bd_mapping; /* page cache */ atomic_t bd_openers; spinlock_t bd_size_lock; /* for bd_inode->i_size updates */ void * bd_claiming; void * bd_holder; const struct blk_holder_ops *bd_holder_ops; struct mutex bd_holder_lock; int bd_holders; struct kobject *bd_holder_dir; atomic_t bd_fsfreeze_count; /* number of freeze requests */ struct mutex bd_fsfreeze_mutex; /* serialize freeze/thaw */ struct partition_meta_info *bd_meta_info; int bd_writers; #ifdef CONFIG_SECURITY void *bd_security; #endif /* * keep this out-of-line as it's both big and not needed in the fast * path */ struct device bd_device; } __randomize_layout; #define bdev_whole(_bdev) \ ((_bdev)->bd_disk->part0) #define dev_to_bdev(device) \ container_of((device), struct block_device, bd_device) #define bdev_kobj(_bdev) \ (&((_bdev)->bd_device.kobj)) /* * Block error status values. See block/blk-core:blk_errors for the details. */ typedef u8 __bitwise blk_status_t; typedef u16 blk_short_t; #define BLK_STS_OK 0 #define BLK_STS_NOTSUPP ((__force blk_status_t)1) #define BLK_STS_TIMEOUT ((__force blk_status_t)2) #define BLK_STS_NOSPC ((__force blk_status_t)3) #define BLK_STS_TRANSPORT ((__force blk_status_t)4) #define BLK_STS_TARGET ((__force blk_status_t)5) #define BLK_STS_RESV_CONFLICT ((__force blk_status_t)6) #define BLK_STS_MEDIUM ((__force blk_status_t)7) #define BLK_STS_PROTECTION ((__force blk_status_t)8) #define BLK_STS_RESOURCE ((__force blk_status_t)9) #define BLK_STS_IOERR ((__force blk_status_t)10) /* hack for device mapper, don't use elsewhere: */ #define BLK_STS_DM_REQUEUE ((__force blk_status_t)11) /* * BLK_STS_AGAIN should only be returned if RQF_NOWAIT is set * and the bio would block (cf bio_wouldblock_error()) */ #define BLK_STS_AGAIN ((__force blk_status_t)12) /* * BLK_STS_DEV_RESOURCE is returned from the driver to the block layer if * device related resources are unavailable, but the driver can guarantee * that the queue will be rerun in the future once resources become * available again. This is typically the case for device specific * resources that are consumed for IO. If the driver fails allocating these * resources, we know that inflight (or pending) IO will free these * resource upon completion. * * This is different from BLK_STS_RESOURCE in that it explicitly references * a device specific resource. For resources of wider scope, allocation * failure can happen without having pending IO. This means that we can't * rely on request completions freeing these resources, as IO may not be in * flight. Examples of that are kernel memory allocations, DMA mappings, or * any other system wide resources. */ #define BLK_STS_DEV_RESOURCE ((__force blk_status_t)13) /* * BLK_STS_ZONE_OPEN_RESOURCE is returned from the driver in the completion * path if the device returns a status indicating that too many zone resources * are currently open. The same command should be successful if resubmitted * after the number of open zones decreases below the device's limits, which is * reported in the request_queue's max_open_zones. */ #define BLK_STS_ZONE_OPEN_RESOURCE ((__force blk_status_t)14) /* * BLK_STS_ZONE_ACTIVE_RESOURCE is returned from the driver in the completion * path if the device returns a status indicating that too many zone resources * are currently active. The same command should be successful if resubmitted * after the number of active zones decreases below the device's limits, which * is reported in the request_queue's max_active_zones. */ #define BLK_STS_ZONE_ACTIVE_RESOURCE ((__force blk_status_t)15) /* * BLK_STS_OFFLINE is returned from the driver when the target device is offline * or is being taken offline. This could help differentiate the case where a * device is intentionally being shut down from a real I/O error. */ #define BLK_STS_OFFLINE ((__force blk_status_t)16) /* * BLK_STS_DURATION_LIMIT is returned from the driver when the target device * aborted the command because it exceeded one of its Command Duration Limits. */ #define BLK_STS_DURATION_LIMIT ((__force blk_status_t)17) /* * Invalid size or alignment. */ #define BLK_STS_INVAL ((__force blk_status_t)19) /** * blk_path_error - returns true if error may be path related * @error: status the request was completed with * * Description: * This classifies block error status into non-retryable errors and ones * that may be successful if retried on a failover path. * * Return: * %false - retrying failover path will not help * %true - may succeed if retried */ static inline bool blk_path_error(blk_status_t error) { switch (error) { case BLK_STS_NOTSUPP: case BLK_STS_NOSPC: case BLK_STS_TARGET: case BLK_STS_RESV_CONFLICT: case BLK_STS_MEDIUM: case BLK_STS_PROTECTION: return false; } /* Anything else could be a path failure, so should be retried */ return true; } struct bio_issue { u64 value; }; typedef __u32 __bitwise blk_opf_t; typedef unsigned int blk_qc_t; #define BLK_QC_T_NONE -1U /* * main unit of I/O for the block layer and lower layers (ie drivers and * stacking drivers) */ struct bio { struct bio *bi_next; /* request queue link */ struct block_device *bi_bdev; blk_opf_t bi_opf; /* bottom bits REQ_OP, top bits * req_flags. */ unsigned short bi_flags; /* BIO_* below */ unsigned short bi_ioprio; enum rw_hint bi_write_hint; u8 bi_write_stream; blk_status_t bi_status; atomic_t __bi_remaining; struct bvec_iter bi_iter; union { /* for polled bios: */ blk_qc_t bi_cookie; /* for plugged zoned writes only: */ unsigned int __bi_nr_segments; }; bio_end_io_t *bi_end_io; void *bi_private; #ifdef CONFIG_BLK_CGROUP /* * Represents the association of the css and request_queue for the bio. * If a bio goes direct to device, it will not have a blkg as it will * not have a request_queue associated with it. The reference is put * on release of the bio. */ struct blkcg_gq *bi_blkg; struct bio_issue bi_issue; #ifdef CONFIG_BLK_CGROUP_IOCOST u64 bi_iocost_cost; #endif #endif #ifdef CONFIG_BLK_INLINE_ENCRYPTION struct bio_crypt_ctx *bi_crypt_context; #endif #if defined(CONFIG_BLK_DEV_INTEGRITY) struct bio_integrity_payload *bi_integrity; /* data integrity */ #endif unsigned short bi_vcnt; /* how many bio_vec's */ /* * Everything starting with bi_max_vecs will be preserved by bio_reset() */ unsigned short bi_max_vecs; /* max bvl_vecs we can hold */ atomic_t __bi_cnt; /* pin count */ struct bio_vec *bi_io_vec; /* the actual vec list */ struct bio_set *bi_pool; /* * We can inline a number of vecs at the end of the bio, to avoid * double allocations for a small number of bio_vecs. This member * MUST obviously be kept at the very end of the bio. */ struct bio_vec bi_inline_vecs[]; }; #define BIO_RESET_BYTES offsetof(struct bio, bi_max_vecs) #define BIO_MAX_SECTORS (UINT_MAX >> SECTOR_SHIFT) /* * bio flags */ enum { BIO_PAGE_PINNED, /* Unpin pages in bio_release_pages() */ BIO_CLONED, /* doesn't own data */ BIO_QUIET, /* Make BIO Quiet */ BIO_CHAIN, /* chained bio, ->bi_remaining in effect */ BIO_REFFED, /* bio has elevated ->bi_cnt */ BIO_BPS_THROTTLED, /* This bio has already been subjected to * throttling rules. Don't do it again. */ BIO_TRACE_COMPLETION, /* bio_endio() should trace the final completion * of this bio. */ BIO_CGROUP_ACCT, /* has been accounted to a cgroup */ BIO_QOS_THROTTLED, /* bio went through rq_qos throttle path */ /* * This bio has completed bps throttling at the single tg granularity, * which is different from BIO_BPS_THROTTLED. When the bio is enqueued * into the sq->queued of the upper tg, or is about to be dispatched, * this flag needs to be cleared. Since blk-throttle and rq_qos are not * on the same hierarchical level, reuse the value. */ BIO_TG_BPS_THROTTLED = BIO_QOS_THROTTLED, BIO_QOS_MERGED, /* but went through rq_qos merge path */ BIO_REMAPPED, BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */ BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */ BIO_FLAG_LAST }; typedef __u32 __bitwise blk_mq_req_flags_t; #define REQ_OP_BITS 8 #define REQ_OP_MASK (__force blk_opf_t)((1 << REQ_OP_BITS) - 1) #define REQ_FLAG_BITS 24 /** * enum req_op - Operations common to the bio and request structures. * We use 8 bits for encoding the operation, and the remaining 24 for flags. * * The least significant bit of the operation number indicates the data * transfer direction: * * - if the least significant bit is set transfers are TO the device * - if the least significant bit is not set transfers are FROM the device * * If a operation does not transfer data the least significant bit has no * meaning. */ enum req_op { /* read sectors from the device */ REQ_OP_READ = (__force blk_opf_t)0, /* write sectors to the device */ REQ_OP_WRITE = (__force blk_opf_t)1, /* flush the volatile write cache */ REQ_OP_FLUSH = (__force blk_opf_t)2, /* discard sectors */ REQ_OP_DISCARD = (__force blk_opf_t)3, /* securely erase sectors */ REQ_OP_SECURE_ERASE = (__force blk_opf_t)5, /* write data at the current zone write pointer */ REQ_OP_ZONE_APPEND = (__force blk_opf_t)7, /* write the zero filled sector many times */ REQ_OP_WRITE_ZEROES = (__force blk_opf_t)9, /* Open a zone */ REQ_OP_ZONE_OPEN = (__force blk_opf_t)10, /* Close a zone */ REQ_OP_ZONE_CLOSE = (__force blk_opf_t)11, /* Transition a zone to full */ REQ_OP_ZONE_FINISH = (__force blk_opf_t)13, /* reset a zone write pointer */ REQ_OP_ZONE_RESET = (__force blk_opf_t)15, /* reset all the zone present on the device */ REQ_OP_ZONE_RESET_ALL = (__force blk_opf_t)17, /* Driver private requests */ REQ_OP_DRV_IN = (__force blk_opf_t)34, REQ_OP_DRV_OUT = (__force blk_opf_t)35, REQ_OP_LAST = (__force blk_opf_t)36, }; /* Keep cmd_flag_name[] in sync with the definitions below */ enum req_flag_bits { __REQ_FAILFAST_DEV = /* no driver retries of device errors */ REQ_OP_BITS, __REQ_FAILFAST_TRANSPORT, /* no driver retries of transport errors */ __REQ_FAILFAST_DRIVER, /* no driver retries of driver errors */ __REQ_SYNC, /* request is sync (sync write or read) */ __REQ_META, /* metadata io request */ __REQ_PRIO, /* boost priority in cfq */ __REQ_NOMERGE, /* don't touch this for merging */ __REQ_IDLE, /* anticipate more IO after this one */ __REQ_INTEGRITY, /* I/O includes block integrity payload */ __REQ_FUA, /* forced unit access */ __REQ_PREFLUSH, /* request for cache flush */ __REQ_RAHEAD, /* read ahead, can fail anytime */ __REQ_BACKGROUND, /* background IO */ __REQ_NOWAIT, /* Don't wait if request will block */ __REQ_POLLED, /* caller polls for completion using bio_poll */ __REQ_ALLOC_CACHE, /* allocate IO from cache if available */ __REQ_SWAP, /* swap I/O */ __REQ_DRV, /* for driver use */ __REQ_FS_PRIVATE, /* for file system (submitter) use */ __REQ_ATOMIC, /* for atomic write operations */ __REQ_P2PDMA, /* contains P2P DMA pages */ /* * Command specific flags, keep last: */ /* for REQ_OP_WRITE_ZEROES: */ __REQ_NOUNMAP, /* do not free blocks when zeroing */ __REQ_NR_BITS, /* stops here */ }; #define REQ_FAILFAST_DEV \ (__force blk_opf_t)(1ULL << __REQ_FAILFAST_DEV) #define REQ_FAILFAST_TRANSPORT \ (__force blk_opf_t)(1ULL << __REQ_FAILFAST_TRANSPORT) #define REQ_FAILFAST_DRIVER \ (__force blk_opf_t)(1ULL << __REQ_FAILFAST_DRIVER) #define REQ_SYNC (__force blk_opf_t)(1ULL << __REQ_SYNC) #define REQ_META (__force blk_opf_t)(1ULL << __REQ_META) #define REQ_PRIO (__force blk_opf_t)(1ULL << __REQ_PRIO) #define REQ_NOMERGE (__force blk_opf_t)(1ULL << __REQ_NOMERGE) #define REQ_IDLE (__force blk_opf_t)(1ULL << __REQ_IDLE) #define REQ_INTEGRITY (__force blk_opf_t)(1ULL << __REQ_INTEGRITY) #define REQ_FUA (__force blk_opf_t)(1ULL << __REQ_FUA) #define REQ_PREFLUSH (__force blk_opf_t)(1ULL << __REQ_PREFLUSH) #define REQ_RAHEAD (__force blk_opf_t)(1ULL << __REQ_RAHEAD) #define REQ_BACKGROUND (__force blk_opf_t)(1ULL << __REQ_BACKGROUND) #define REQ_NOWAIT (__force blk_opf_t)(1ULL << __REQ_NOWAIT) #define REQ_POLLED (__force blk_opf_t)(1ULL << __REQ_POLLED) #define REQ_ALLOC_CACHE (__force blk_opf_t)(1ULL << __REQ_ALLOC_CACHE) #define REQ_SWAP (__force blk_opf_t)(1ULL << __REQ_SWAP) #define REQ_DRV (__force blk_opf_t)(1ULL << __REQ_DRV) #define REQ_FS_PRIVATE (__force blk_opf_t)(1ULL << __REQ_FS_PRIVATE) #define REQ_ATOMIC (__force blk_opf_t)(1ULL << __REQ_ATOMIC) #define REQ_P2PDMA (__force blk_opf_t)(1ULL << __REQ_P2PDMA) #define REQ_NOUNMAP (__force blk_opf_t)(1ULL << __REQ_NOUNMAP) #define REQ_FAILFAST_MASK \ (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER) #define REQ_NOMERGE_FLAGS \ (REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA) enum stat_group { STAT_READ, STAT_WRITE, STAT_DISCARD, STAT_FLUSH, NR_STAT_GROUPS }; static inline enum req_op bio_op(const struct bio *bio) { return bio->bi_opf & REQ_OP_MASK; } static inline bool op_is_write(blk_opf_t op) { return !!(op & (__force blk_opf_t)1); } /* * Check if the bio or request is one that needs special treatment in the * flush state machine. */ static inline bool op_is_flush(blk_opf_t op) { return op & (REQ_FUA | REQ_PREFLUSH); } /* * Reads are always treated as synchronous, as are requests with the FUA or * PREFLUSH flag. Other operations may be marked as synchronous using the * REQ_SYNC flag. */ static inline bool op_is_sync(blk_opf_t op) { return (op & REQ_OP_MASK) == REQ_OP_READ || (op & (REQ_SYNC | REQ_FUA | REQ_PREFLUSH)); } static inline bool op_is_discard(blk_opf_t op) { return (op & REQ_OP_MASK) == REQ_OP_DISCARD; } /* * Check if a bio or request operation is a zone management operation, with * the exception of REQ_OP_ZONE_RESET_ALL which is treated as a special case * due to its different handling in the block layer and device response in * case of command failure. */ static inline bool op_is_zone_mgmt(enum req_op op) { switch (op & REQ_OP_MASK) { case REQ_OP_ZONE_RESET: case REQ_OP_ZONE_OPEN: case REQ_OP_ZONE_CLOSE: case REQ_OP_ZONE_FINISH: return true; default: return false; } } static inline int op_stat_group(enum req_op op) { if (op_is_discard(op)) return STAT_DISCARD; return op_is_write(op); } struct blk_rq_stat { u64 mean; u64 min; u64 max; u32 nr_samples; u64 batch; }; #endif /* __LINUX_BLK_TYPES_H */ |
| 4 28 28 28 3 3 24 24 24 85 87 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 | // SPDX-License-Identifier: GPL-2.0-only /* Helper handling for netfilter. */ /* (C) 1999-2001 Paul `Rusty' Russell * (C) 2002-2006 Netfilter Core Team <coreteam@netfilter.org> * (C) 2003,2004 USAGI/WIDE Project <http://www.linux-ipv6.org> * (C) 2006-2012 Patrick McHardy <kaber@trash.net> */ #include <linux/types.h> #include <linux/netfilter.h> #include <linux/module.h> #include <linux/skbuff.h> #include <linux/vmalloc.h> #include <linux/stddef.h> #include <linux/random.h> #include <linux/err.h> #include <linux/kernel.h> #include <linux/netdevice.h> #include <linux/rculist.h> #include <linux/rtnetlink.h> #include <net/netfilter/nf_conntrack.h> #include <net/netfilter/nf_conntrack_core.h> #include <net/netfilter/nf_conntrack_ecache.h> #include <net/netfilter/nf_conntrack_extend.h> #include <net/netfilter/nf_conntrack_helper.h> #include <net/netfilter/nf_conntrack_l4proto.h> #include <net/netfilter/nf_conntrack_seqadj.h> #include <net/netfilter/nf_log.h> #include <net/ip.h> static DEFINE_MUTEX(nf_ct_helper_mutex); struct hlist_head *nf_ct_helper_hash __read_mostly; EXPORT_SYMBOL_GPL(nf_ct_helper_hash); unsigned int nf_ct_helper_hsize __read_mostly; EXPORT_SYMBOL_GPL(nf_ct_helper_hsize); static unsigned int nf_ct_helper_count __read_mostly; static DEFINE_MUTEX(nf_ct_nat_helpers_mutex); static struct list_head nf_ct_nat_helpers __read_mostly; /* Stupid hash, but collision free for the default registrations of the * helpers currently in the kernel. */ static unsigned int helper_hash(const struct nf_conntrack_tuple *tuple) { return (((tuple->src.l3num << 8) | tuple->dst.protonum) ^ (__force __u16)tuple->src.u.all) % nf_ct_helper_hsize; } struct nf_conntrack_helper * __nf_conntrack_helper_find(const char *name, u16 l3num, u8 protonum) { struct nf_conntrack_helper *h; unsigned int i; for (i = 0; i < nf_ct_helper_hsize; i++) { hlist_for_each_entry_rcu(h, &nf_ct_helper_hash[i], hnode) { if (strcmp(h->name, name)) continue; if (h->tuple.src.l3num != NFPROTO_UNSPEC && h->tuple.src.l3num != l3num) continue; if (h->tuple.dst.protonum == protonum) return h; } } return NULL; } EXPORT_SYMBOL_GPL(__nf_conntrack_helper_find); struct nf_conntrack_helper * nf_conntrack_helper_try_module_get(const char *name, u16 l3num, u8 protonum) { struct nf_conntrack_helper *h; rcu_read_lock(); h = __nf_conntrack_helper_find(name, l3num, protonum); #ifdef CONFIG_MODULES if (h == NULL) { rcu_read_unlock(); if (request_module("nfct-helper-%s", name) == 0) { rcu_read_lock(); h = __nf_conntrack_helper_find(name, l3num, protonum); } else { return h; } } #endif if (h != NULL && !try_module_get(h->me)) h = NULL; if (h != NULL && !refcount_inc_not_zero(&h->refcnt)) { module_put(h->me); h = NULL; } rcu_read_unlock(); return h; } EXPORT_SYMBOL_GPL(nf_conntrack_helper_try_module_get); void nf_conntrack_helper_put(struct nf_conntrack_helper *helper) { refcount_dec(&helper->refcnt); module_put(helper->me); } EXPORT_SYMBOL_GPL(nf_conntrack_helper_put); static struct nf_conntrack_nat_helper * nf_conntrack_nat_helper_find(const char *mod_name) { struct nf_conntrack_nat_helper *cur; bool found = false; list_for_each_entry_rcu(cur, &nf_ct_nat_helpers, list) { if (!strcmp(cur->mod_name, mod_name)) { found = true; break; } } return found ? cur : NULL; } int nf_nat_helper_try_module_get(const char *name, u16 l3num, u8 protonum) { struct nf_conntrack_helper *h; struct nf_conntrack_nat_helper *nat; char mod_name[NF_CT_HELPER_NAME_LEN]; int ret = 0; rcu_read_lock(); h = __nf_conntrack_helper_find(name, l3num, protonum); if (!h) { rcu_read_unlock(); return -ENOENT; } nat = nf_conntrack_nat_helper_find(h->nat_mod_name); if (!nat) { snprintf(mod_name, sizeof(mod_name), "%s", h->nat_mod_name); rcu_read_unlock(); request_module("%s", mod_name); rcu_read_lock(); nat = nf_conntrack_nat_helper_find(mod_name); if (!nat) { rcu_read_unlock(); return -ENOENT; } } if (!try_module_get(nat->module)) ret = -ENOENT; rcu_read_unlock(); return ret; } EXPORT_SYMBOL_GPL(nf_nat_helper_try_module_get); void nf_nat_helper_put(struct nf_conntrack_helper *helper) { struct nf_conntrack_nat_helper *nat; nat = nf_conntrack_nat_helper_find(helper->nat_mod_name); if (WARN_ON_ONCE(!nat)) return; module_put(nat->module); } EXPORT_SYMBOL_GPL(nf_nat_helper_put); struct nf_conn_help * nf_ct_helper_ext_add(struct nf_conn *ct, gfp_t gfp) { struct nf_conn_help *help; help = nf_ct_ext_add(ct, NF_CT_EXT_HELPER, gfp); if (help) INIT_HLIST_HEAD(&help->expectations); else pr_debug("failed to add helper extension area"); return help; } EXPORT_SYMBOL_GPL(nf_ct_helper_ext_add); int __nf_ct_try_assign_helper(struct nf_conn *ct, struct nf_conn *tmpl, gfp_t flags) { struct nf_conntrack_helper *helper = NULL; struct nf_conn_help *help; /* We already got a helper explicitly attached (e.g. nft_ct) */ if (test_bit(IPS_HELPER_BIT, &ct->status)) return 0; if (WARN_ON_ONCE(!tmpl)) return 0; help = nfct_help(tmpl); if (help != NULL) { helper = rcu_dereference(help->helper); set_bit(IPS_HELPER_BIT, &ct->status); } help = nfct_help(ct); if (helper == NULL) { if (help) RCU_INIT_POINTER(help->helper, NULL); return 0; } if (help == NULL) { help = nf_ct_helper_ext_add(ct, flags); if (help == NULL) return -ENOMEM; } else { /* We only allow helper re-assignment of the same sort since * we cannot reallocate the helper extension area. */ struct nf_conntrack_helper *tmp = rcu_dereference(help->helper); if (tmp && tmp->help != helper->help) { RCU_INIT_POINTER(help->helper, NULL); return 0; } } rcu_assign_pointer(help->helper, helper); return 0; } EXPORT_SYMBOL_GPL(__nf_ct_try_assign_helper); /* appropriate ct lock protecting must be taken by caller */ static int unhelp(struct nf_conn *ct, void *me) { struct nf_conn_help *help = nfct_help(ct); if (help && rcu_dereference_raw(help->helper) == me) { nf_conntrack_event(IPCT_HELPER, ct); RCU_INIT_POINTER(help->helper, NULL); } /* We are not intended to delete this conntrack. */ return 0; } void nf_ct_helper_destroy(struct nf_conn *ct) { struct nf_conn_help *help = nfct_help(ct); struct nf_conntrack_helper *helper; if (help) { rcu_read_lock(); helper = rcu_dereference(help->helper); if (helper && helper->destroy) helper->destroy(ct); rcu_read_unlock(); } } static LIST_HEAD(nf_ct_helper_expectfn_list); void nf_ct_helper_expectfn_register(struct nf_ct_helper_expectfn *n) { spin_lock_bh(&nf_conntrack_expect_lock); list_add_rcu(&n->head, &nf_ct_helper_expectfn_list); spin_unlock_bh(&nf_conntrack_expect_lock); } EXPORT_SYMBOL_GPL(nf_ct_helper_expectfn_register); void nf_ct_helper_expectfn_unregister(struct nf_ct_helper_expectfn *n) { spin_lock_bh(&nf_conntrack_expect_lock); list_del_rcu(&n->head); spin_unlock_bh(&nf_conntrack_expect_lock); } EXPORT_SYMBOL_GPL(nf_ct_helper_expectfn_unregister); /* Caller should hold the rcu lock */ struct nf_ct_helper_expectfn * nf_ct_helper_expectfn_find_by_name(const char *name) { struct nf_ct_helper_expectfn *cur; bool found = false; list_for_each_entry_rcu(cur, &nf_ct_helper_expectfn_list, head) { if (!strcmp(cur->name, name)) { found = true; break; } } return found ? cur : NULL; } EXPORT_SYMBOL_GPL(nf_ct_helper_expectfn_find_by_name); /* Caller should hold the rcu lock */ struct nf_ct_helper_expectfn * nf_ct_helper_expectfn_find_by_symbol(const void *symbol) { struct nf_ct_helper_expectfn *cur; bool found = false; list_for_each_entry_rcu(cur, &nf_ct_helper_expectfn_list, head) { if (cur->expectfn == symbol) { found = true; break; } } return found ? cur : NULL; } EXPORT_SYMBOL_GPL(nf_ct_helper_expectfn_find_by_symbol); __printf(3, 4) void nf_ct_helper_log(struct sk_buff *skb, const struct nf_conn *ct, const char *fmt, ...) { const struct nf_conn_help *help; const struct nf_conntrack_helper *helper; struct va_format vaf; va_list args; va_start(args, fmt); vaf.fmt = fmt; vaf.va = &args; /* Called from the helper function, this call never fails */ help = nfct_help(ct); /* rcu_read_lock()ed by nf_hook_thresh */ helper = rcu_dereference(help->helper); nf_log_packet(nf_ct_net(ct), nf_ct_l3num(ct), 0, skb, NULL, NULL, NULL, "nf_ct_%s: dropping packet: %pV ", helper->name, &vaf); va_end(args); } EXPORT_SYMBOL_GPL(nf_ct_helper_log); int nf_conntrack_helper_register(struct nf_conntrack_helper *me) { struct nf_conntrack_tuple_mask mask = { .src.u.all = htons(0xFFFF) }; unsigned int h = helper_hash(&me->tuple); struct nf_conntrack_helper *cur; int ret = 0, i; BUG_ON(me->expect_policy == NULL); BUG_ON(me->expect_class_max >= NF_CT_MAX_EXPECT_CLASSES); BUG_ON(strlen(me->name) > NF_CT_HELPER_NAME_LEN - 1); if (!nf_ct_helper_hash) return -ENOENT; if (me->expect_policy->max_expected > NF_CT_EXPECT_MAX_CNT) return -EINVAL; mutex_lock(&nf_ct_helper_mutex); for (i = 0; i < nf_ct_helper_hsize; i++) { hlist_for_each_entry(cur, &nf_ct_helper_hash[i], hnode) { if (!strcmp(cur->name, me->name) && (cur->tuple.src.l3num == NFPROTO_UNSPEC || cur->tuple.src.l3num == me->tuple.src.l3num) && cur->tuple.dst.protonum == me->tuple.dst.protonum) { ret = -EBUSY; goto out; } } } /* avoid unpredictable behaviour for auto_assign_helper */ if (!(me->flags & NF_CT_HELPER_F_USERSPACE)) { hlist_for_each_entry(cur, &nf_ct_helper_hash[h], hnode) { if (nf_ct_tuple_src_mask_cmp(&cur->tuple, &me->tuple, &mask)) { ret = -EBUSY; goto out; } } } refcount_set(&me->refcnt, 1); hlist_add_head_rcu(&me->hnode, &nf_ct_helper_hash[h]); nf_ct_helper_count++; out: mutex_unlock(&nf_ct_helper_mutex); return ret; } EXPORT_SYMBOL_GPL(nf_conntrack_helper_register); static bool expect_iter_me(struct nf_conntrack_expect *exp, void *data) { struct nf_conn_help *help = nfct_help(exp->master); const struct nf_conntrack_helper *me = data; const struct nf_conntrack_helper *this; if (exp->helper == me) return true; this = rcu_dereference_protected(help->helper, lockdep_is_held(&nf_conntrack_expect_lock)); return this == me; } void nf_conntrack_helper_unregister(struct nf_conntrack_helper *me) { mutex_lock(&nf_ct_helper_mutex); hlist_del_rcu(&me->hnode); nf_ct_helper_count--; mutex_unlock(&nf_ct_helper_mutex); /* Make sure every nothing is still using the helper unless its a * connection in the hash. */ synchronize_rcu(); nf_ct_expect_iterate_destroy(expect_iter_me, NULL); nf_ct_iterate_destroy(unhelp, me); } EXPORT_SYMBOL_GPL(nf_conntrack_helper_unregister); void nf_ct_helper_init(struct nf_conntrack_helper *helper, u16 l3num, u16 protonum, const char *name, u16 default_port, u16 spec_port, u32 id, const struct nf_conntrack_expect_policy *exp_pol, u32 expect_class_max, int (*help)(struct sk_buff *skb, unsigned int protoff, struct nf_conn *ct, enum ip_conntrack_info ctinfo), int (*from_nlattr)(struct nlattr *attr, struct nf_conn *ct), struct module *module) { helper->tuple.src.l3num = l3num; helper->tuple.dst.protonum = protonum; helper->tuple.src.u.all = htons(spec_port); helper->expect_policy = exp_pol; helper->expect_class_max = expect_class_max; helper->help = help; helper->from_nlattr = from_nlattr; helper->me = module; snprintf(helper->nat_mod_name, sizeof(helper->nat_mod_name), NF_NAT_HELPER_PREFIX "%s", name); if (spec_port == default_port) snprintf(helper->name, sizeof(helper->name), "%s", name); else snprintf(helper->name, sizeof(helper->name), "%s-%u", name, id); } EXPORT_SYMBOL_GPL(nf_ct_helper_init); int nf_conntrack_helpers_register(struct nf_conntrack_helper *helper, unsigned int n) { unsigned int i; int err = 0; for (i = 0; i < n; i++) { err = nf_conntrack_helper_register(&helper[i]); if (err < 0) goto err; } return err; err: if (i > 0) nf_conntrack_helpers_unregister(helper, i); return err; } EXPORT_SYMBOL_GPL(nf_conntrack_helpers_register); void nf_conntrack_helpers_unregister(struct nf_conntrack_helper *helper, unsigned int n) { while (n-- > 0) nf_conntrack_helper_unregister(&helper[n]); } EXPORT_SYMBOL_GPL(nf_conntrack_helpers_unregister); void nf_nat_helper_register(struct nf_conntrack_nat_helper *nat) { mutex_lock(&nf_ct_nat_helpers_mutex); list_add_rcu(&nat->list, &nf_ct_nat_helpers); mutex_unlock(&nf_ct_nat_helpers_mutex); } EXPORT_SYMBOL_GPL(nf_nat_helper_register); void nf_nat_helper_unregister(struct nf_conntrack_nat_helper *nat) { mutex_lock(&nf_ct_nat_helpers_mutex); list_del_rcu(&nat->list); mutex_unlock(&nf_ct_nat_helpers_mutex); } EXPORT_SYMBOL_GPL(nf_nat_helper_unregister); int nf_conntrack_helper_init(void) { nf_ct_helper_hsize = 1; /* gets rounded up to use one page */ nf_ct_helper_hash = nf_ct_alloc_hashtable(&nf_ct_helper_hsize, 0); if (!nf_ct_helper_hash) return -ENOMEM; INIT_LIST_HEAD(&nf_ct_nat_helpers); return 0; } void nf_conntrack_helper_fini(void) { kvfree(nf_ct_helper_hash); nf_ct_helper_hash = NULL; } |
| 42 85 260 41 41 11 11 217 218 124 14 1 11 1 1 3 4 4 2 2 377 3 3 339 41 491 388 104 104 387 495 6 6 3 3 22 3 19 486 4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 | /* License: GPL */ #include <linux/filter.h> #include <linux/mutex.h> #include <linux/socket.h> #include <linux/skbuff.h> #include <net/netlink.h> #include <net/net_namespace.h> #include <linux/module.h> #include <net/sock.h> #include <linux/kernel.h> #include <linux/tcp.h> #include <linux/workqueue.h> #include <linux/nospec.h> #include <linux/cookie.h> #include <linux/inet_diag.h> #include <linux/sock_diag.h> static const struct sock_diag_handler __rcu *sock_diag_handlers[AF_MAX]; static const struct sock_diag_inet_compat __rcu *inet_rcv_compat; static struct workqueue_struct *broadcast_wq; DEFINE_COOKIE(sock_cookie); u64 __sock_gen_cookie(struct sock *sk) { u64 res = atomic64_read(&sk->sk_cookie); if (!res) { u64 new = gen_cookie_next(&sock_cookie); atomic64_cmpxchg(&sk->sk_cookie, res, new); /* Another thread might have changed sk_cookie before us. */ res = atomic64_read(&sk->sk_cookie); } return res; } int sock_diag_check_cookie(struct sock *sk, const __u32 *cookie) { u64 res; if (cookie[0] == INET_DIAG_NOCOOKIE && cookie[1] == INET_DIAG_NOCOOKIE) return 0; res = sock_gen_cookie(sk); if ((u32)res != cookie[0] || (u32)(res >> 32) != cookie[1]) return -ESTALE; return 0; } EXPORT_SYMBOL_GPL(sock_diag_check_cookie); void sock_diag_save_cookie(struct sock *sk, __u32 *cookie) { u64 res = sock_gen_cookie(sk); cookie[0] = (u32)res; cookie[1] = (u32)(res >> 32); } EXPORT_SYMBOL_GPL(sock_diag_save_cookie); int sock_diag_put_meminfo(struct sock *sk, struct sk_buff *skb, int attrtype) { u32 mem[SK_MEMINFO_VARS]; sk_get_meminfo(sk, mem); return nla_put(skb, attrtype, sizeof(mem), &mem); } EXPORT_SYMBOL_GPL(sock_diag_put_meminfo); int sock_diag_put_filterinfo(bool may_report_filterinfo, struct sock *sk, struct sk_buff *skb, int attrtype) { struct sock_fprog_kern *fprog; struct sk_filter *filter; struct nlattr *attr; unsigned int flen; int err = 0; if (!may_report_filterinfo) { nla_reserve(skb, attrtype, 0); return 0; } rcu_read_lock(); filter = rcu_dereference(sk->sk_filter); if (!filter) goto out; fprog = filter->prog->orig_prog; if (!fprog) goto out; flen = bpf_classic_proglen(fprog); attr = nla_reserve(skb, attrtype, flen); if (attr == NULL) { err = -EMSGSIZE; goto out; } memcpy(nla_data(attr), fprog->filter, flen); out: rcu_read_unlock(); return err; } EXPORT_SYMBOL(sock_diag_put_filterinfo); struct broadcast_sk { struct sock *sk; struct work_struct work; }; static size_t sock_diag_nlmsg_size(void) { return NLMSG_ALIGN(sizeof(struct inet_diag_msg) + nla_total_size(sizeof(u8)) /* INET_DIAG_PROTOCOL */ + nla_total_size_64bit(sizeof(struct tcp_info))); /* INET_DIAG_INFO */ } static const struct sock_diag_handler *sock_diag_lock_handler(int family) { const struct sock_diag_handler *handler; rcu_read_lock(); handler = rcu_dereference(sock_diag_handlers[family]); if (handler && !try_module_get(handler->owner)) handler = NULL; rcu_read_unlock(); return handler; } static void sock_diag_unlock_handler(const struct sock_diag_handler *handler) { module_put(handler->owner); } static void sock_diag_broadcast_destroy_work(struct work_struct *work) { struct broadcast_sk *bsk = container_of(work, struct broadcast_sk, work); struct sock *sk = bsk->sk; const struct sock_diag_handler *hndl; struct sk_buff *skb; const enum sknetlink_groups group = sock_diag_destroy_group(sk); int err = -1; WARN_ON(group == SKNLGRP_NONE); skb = nlmsg_new(sock_diag_nlmsg_size(), GFP_KERNEL); if (!skb) goto out; hndl = sock_diag_lock_handler(sk->sk_family); if (hndl) { if (hndl->get_info) err = hndl->get_info(skb, sk); sock_diag_unlock_handler(hndl); } if (!err) nlmsg_multicast(sock_net(sk)->diag_nlsk, skb, 0, group, GFP_KERNEL); else kfree_skb(skb); out: sk_destruct(sk); kfree(bsk); } void sock_diag_broadcast_destroy(struct sock *sk) { /* Note, this function is often called from an interrupt context. */ struct broadcast_sk *bsk = kmalloc(sizeof(struct broadcast_sk), GFP_ATOMIC); if (!bsk) return sk_destruct(sk); bsk->sk = sk; INIT_WORK(&bsk->work, sock_diag_broadcast_destroy_work); queue_work(broadcast_wq, &bsk->work); } void sock_diag_register_inet_compat(const struct sock_diag_inet_compat *ptr) { xchg(&inet_rcv_compat, RCU_INITIALIZER(ptr)); } EXPORT_SYMBOL_GPL(sock_diag_register_inet_compat); void sock_diag_unregister_inet_compat(const struct sock_diag_inet_compat *ptr) { const struct sock_diag_inet_compat *old; old = unrcu_pointer(xchg(&inet_rcv_compat, NULL)); WARN_ON_ONCE(old != ptr); } EXPORT_SYMBOL_GPL(sock_diag_unregister_inet_compat); int sock_diag_register(const struct sock_diag_handler *hndl) { int family = hndl->family; if (family >= AF_MAX) return -EINVAL; return !cmpxchg((const struct sock_diag_handler **) &sock_diag_handlers[family], NULL, hndl) ? 0 : -EBUSY; } EXPORT_SYMBOL_GPL(sock_diag_register); void sock_diag_unregister(const struct sock_diag_handler *hndl) { int family = hndl->family; if (family >= AF_MAX) return; xchg((const struct sock_diag_handler **)&sock_diag_handlers[family], NULL); } EXPORT_SYMBOL_GPL(sock_diag_unregister); static int __sock_diag_cmd(struct sk_buff *skb, struct nlmsghdr *nlh) { int err; struct sock_diag_req *req = nlmsg_data(nlh); const struct sock_diag_handler *hndl; if (nlmsg_len(nlh) < sizeof(*req)) return -EINVAL; if (req->sdiag_family >= AF_MAX) return -EINVAL; req->sdiag_family = array_index_nospec(req->sdiag_family, AF_MAX); if (!rcu_access_pointer(sock_diag_handlers[req->sdiag_family])) sock_load_diag_module(req->sdiag_family, 0); hndl = sock_diag_lock_handler(req->sdiag_family); if (hndl == NULL) return -ENOENT; if (nlh->nlmsg_type == SOCK_DIAG_BY_FAMILY) err = hndl->dump(skb, nlh); else if (nlh->nlmsg_type == SOCK_DESTROY && hndl->destroy) err = hndl->destroy(skb, nlh); else err = -EOPNOTSUPP; sock_diag_unlock_handler(hndl); return err; } static int sock_diag_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, struct netlink_ext_ack *extack) { const struct sock_diag_inet_compat *ptr; int ret; switch (nlh->nlmsg_type) { case TCPDIAG_GETSOCK: if (!rcu_access_pointer(inet_rcv_compat)) sock_load_diag_module(AF_INET, 0); rcu_read_lock(); ptr = rcu_dereference(inet_rcv_compat); if (ptr && !try_module_get(ptr->owner)) ptr = NULL; rcu_read_unlock(); ret = -EOPNOTSUPP; if (ptr) { ret = ptr->fn(skb, nlh); module_put(ptr->owner); } return ret; case SOCK_DIAG_BY_FAMILY: case SOCK_DESTROY: return __sock_diag_cmd(skb, nlh); default: return -EINVAL; } } static void sock_diag_rcv(struct sk_buff *skb) { netlink_rcv_skb(skb, &sock_diag_rcv_msg); } static int sock_diag_bind(struct net *net, int group) { switch (group) { case SKNLGRP_INET_TCP_DESTROY: case SKNLGRP_INET_UDP_DESTROY: if (!rcu_access_pointer(sock_diag_handlers[AF_INET])) sock_load_diag_module(AF_INET, 0); break; case SKNLGRP_INET6_TCP_DESTROY: case SKNLGRP_INET6_UDP_DESTROY: if (!rcu_access_pointer(sock_diag_handlers[AF_INET6])) sock_load_diag_module(AF_INET6, 0); break; } return 0; } int sock_diag_destroy(struct sock *sk, int err) { if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; if (!sk->sk_prot->diag_destroy) return -EOPNOTSUPP; return sk->sk_prot->diag_destroy(sk, err); } EXPORT_SYMBOL_GPL(sock_diag_destroy); static int __net_init diag_net_init(struct net *net) { struct netlink_kernel_cfg cfg = { .groups = SKNLGRP_MAX, .input = sock_diag_rcv, .bind = sock_diag_bind, .flags = NL_CFG_F_NONROOT_RECV, }; net->diag_nlsk = netlink_kernel_create(net, NETLINK_SOCK_DIAG, &cfg); return net->diag_nlsk == NULL ? -ENOMEM : 0; } static void __net_exit diag_net_exit(struct net *net) { netlink_kernel_release(net->diag_nlsk); net->diag_nlsk = NULL; } static struct pernet_operations diag_net_ops = { .init = diag_net_init, .exit = diag_net_exit, }; static int __init sock_diag_init(void) { broadcast_wq = alloc_workqueue("sock_diag_events", 0, 0); BUG_ON(!broadcast_wq); return register_pernet_subsys(&diag_net_ops); } device_initcall(sock_diag_init); |
| 25 461 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 | /* SPDX-License-Identifier: GPL-2.0 */ /* * connection tracking helpers. * * 16 Dec 2003: Yasuyuki Kozakai @USAGI <yasuyuki.kozakai@toshiba.co.jp> * - generalize L3 protocol dependent part. * * Derived from include/linux/netfiter_ipv4/ip_conntrack_helper.h */ #ifndef _NF_CONNTRACK_HELPER_H #define _NF_CONNTRACK_HELPER_H #include <linux/refcount.h> #include <net/netfilter/nf_conntrack.h> #include <net/netfilter/nf_conntrack_extend.h> #include <net/netfilter/nf_conntrack_expect.h> #define NF_NAT_HELPER_PREFIX "ip_nat_" #define NF_NAT_HELPER_NAME(name) NF_NAT_HELPER_PREFIX name #define MODULE_ALIAS_NF_NAT_HELPER(name) \ MODULE_ALIAS(NF_NAT_HELPER_NAME(name)) struct module; enum nf_ct_helper_flags { NF_CT_HELPER_F_USERSPACE = (1 << 0), NF_CT_HELPER_F_CONFIGURED = (1 << 1), }; #define NF_CT_HELPER_NAME_LEN 16 struct nf_conntrack_helper { struct hlist_node hnode; /* Internal use. */ char name[NF_CT_HELPER_NAME_LEN]; /* name of the module */ refcount_t refcnt; struct module *me; /* pointer to self */ const struct nf_conntrack_expect_policy *expect_policy; /* Tuple of things we will help (compared against server response) */ struct nf_conntrack_tuple tuple; /* Function to call when data passes; return verdict, or -1 to invalidate. */ int (*help)(struct sk_buff *skb, unsigned int protoff, struct nf_conn *ct, enum ip_conntrack_info conntrackinfo); void (*destroy)(struct nf_conn *ct); int (*from_nlattr)(struct nlattr *attr, struct nf_conn *ct); int (*to_nlattr)(struct sk_buff *skb, const struct nf_conn *ct); unsigned int expect_class_max; unsigned int flags; /* For user-space helpers: */ unsigned int queue_num; /* length of userspace private data stored in nf_conn_help->data */ u16 data_len; /* name of NAT helper module */ char nat_mod_name[NF_CT_HELPER_NAME_LEN]; }; /* Must be kept in sync with the classes defined by helpers */ #define NF_CT_MAX_EXPECT_CLASSES 4 /* nf_conn feature for connections that have a helper */ struct nf_conn_help { /* Helper. if any */ struct nf_conntrack_helper __rcu *helper; struct hlist_head expectations; /* Current number of expected connections */ u8 expecting[NF_CT_MAX_EXPECT_CLASSES]; /* private helper information. */ char data[32] __aligned(8); }; #define NF_CT_HELPER_BUILD_BUG_ON(structsize) \ BUILD_BUG_ON((structsize) > sizeof_field(struct nf_conn_help, data)) struct nf_conntrack_helper *__nf_conntrack_helper_find(const char *name, u16 l3num, u8 protonum); struct nf_conntrack_helper *nf_conntrack_helper_try_module_get(const char *name, u16 l3num, u8 protonum); void nf_conntrack_helper_put(struct nf_conntrack_helper *helper); void nf_ct_helper_init(struct nf_conntrack_helper *helper, u16 l3num, u16 protonum, const char *name, u16 default_port, u16 spec_port, u32 id, const struct nf_conntrack_expect_policy *exp_pol, u32 expect_class_max, int (*help)(struct sk_buff *skb, unsigned int protoff, struct nf_conn *ct, enum ip_conntrack_info ctinfo), int (*from_nlattr)(struct nlattr *attr, struct nf_conn *ct), struct module *module); int nf_conntrack_helper_register(struct nf_conntrack_helper *); void nf_conntrack_helper_unregister(struct nf_conntrack_helper *); int nf_conntrack_helpers_register(struct nf_conntrack_helper *, unsigned int); void nf_conntrack_helpers_unregister(struct nf_conntrack_helper *, unsigned int); struct nf_conn_help *nf_ct_helper_ext_add(struct nf_conn *ct, gfp_t gfp); int __nf_ct_try_assign_helper(struct nf_conn *ct, struct nf_conn *tmpl, gfp_t flags); int nf_ct_helper(struct sk_buff *skb, struct nf_conn *ct, enum ip_conntrack_info ctinfo, u16 proto); int nf_ct_add_helper(struct nf_conn *ct, const char *name, u8 family, u8 proto, bool nat, struct nf_conntrack_helper **hp); void nf_ct_helper_destroy(struct nf_conn *ct); static inline struct nf_conn_help *nfct_help(const struct nf_conn *ct) { return nf_ct_ext_find(ct, NF_CT_EXT_HELPER); } static inline void *nfct_help_data(const struct nf_conn *ct) { struct nf_conn_help *help; help = nf_ct_ext_find(ct, NF_CT_EXT_HELPER); return (void *)help->data; } int nf_conntrack_helper_init(void); void nf_conntrack_helper_fini(void); int nf_conntrack_broadcast_help(struct sk_buff *skb, struct nf_conn *ct, enum ip_conntrack_info ctinfo, unsigned int timeout); struct nf_ct_helper_expectfn { struct list_head head; const char *name; void (*expectfn)(struct nf_conn *ct, struct nf_conntrack_expect *exp); }; __printf(3,4) void nf_ct_helper_log(struct sk_buff *skb, const struct nf_conn *ct, const char *fmt, ...); void nf_ct_helper_expectfn_register(struct nf_ct_helper_expectfn *n); void nf_ct_helper_expectfn_unregister(struct nf_ct_helper_expectfn *n); struct nf_ct_helper_expectfn * nf_ct_helper_expectfn_find_by_name(const char *name); struct nf_ct_helper_expectfn * nf_ct_helper_expectfn_find_by_symbol(const void *symbol); extern struct hlist_head *nf_ct_helper_hash; extern unsigned int nf_ct_helper_hsize; struct nf_conntrack_nat_helper { struct list_head list; char mod_name[NF_CT_HELPER_NAME_LEN]; /* module name */ struct module *module; /* pointer to self */ }; #define NF_CT_NAT_HELPER_INIT(name) \ { \ .mod_name = NF_NAT_HELPER_NAME(name), \ .module = THIS_MODULE \ } void nf_nat_helper_register(struct nf_conntrack_nat_helper *nat); void nf_nat_helper_unregister(struct nf_conntrack_nat_helper *nat); int nf_nat_helper_try_module_get(const char *name, u16 l3num, u8 protonum); void nf_nat_helper_put(struct nf_conntrack_helper *helper); #endif /*_NF_CONNTRACK_HELPER_H*/ |
| 1 1 1 2 1 1 1 1 1 1 14 34 7 2 1 4 38 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 | // SPDX-License-Identifier: GPL-2.0+ /* * Copyright (C) 2003-2008 Takahiro Hirofuchi */ #include <linux/kthread.h> #include <linux/slab.h> #include "usbip_common.h" #include "vhci.h" /* get URB from transmitted urb queue. caller must hold vdev->priv_lock */ struct urb *pickup_urb_and_free_priv(struct vhci_device *vdev, __u32 seqnum) { struct vhci_priv *priv, *tmp; struct urb *urb = NULL; int status; list_for_each_entry_safe(priv, tmp, &vdev->priv_rx, list) { if (priv->seqnum != seqnum) continue; urb = priv->urb; status = urb->status; usbip_dbg_vhci_rx("find urb seqnum %u\n", seqnum); switch (status) { case -ENOENT: fallthrough; case -ECONNRESET: dev_dbg(&urb->dev->dev, "urb seq# %u was unlinked %ssynchronously\n", seqnum, status == -ENOENT ? "" : "a"); break; case -EINPROGRESS: /* no info output */ break; default: dev_dbg(&urb->dev->dev, "urb seq# %u may be in a error, status %d\n", seqnum, status); } list_del(&priv->list); kfree(priv); urb->hcpriv = NULL; break; } return urb; } static void vhci_recv_ret_submit(struct vhci_device *vdev, struct usbip_header *pdu) { struct vhci_hcd *vhci_hcd = vdev_to_vhci_hcd(vdev); struct vhci *vhci = vhci_hcd->vhci; struct usbip_device *ud = &vdev->ud; struct urb *urb; unsigned long flags; spin_lock_irqsave(&vdev->priv_lock, flags); urb = pickup_urb_and_free_priv(vdev, pdu->base.seqnum); spin_unlock_irqrestore(&vdev->priv_lock, flags); if (!urb) { pr_err("cannot find a urb of seqnum %u max seqnum %u\n", pdu->base.seqnum, atomic_read(&vhci_hcd->seqnum)); usbip_event_add(ud, VDEV_EVENT_ERROR_TCP); return; } /* unpack the pdu to a urb */ usbip_pack_pdu(pdu, urb, USBIP_RET_SUBMIT, 0); /* recv transfer buffer */ if (usbip_recv_xbuff(ud, urb) < 0) { urb->status = -EPROTO; goto error; } /* recv iso_packet_descriptor */ if (usbip_recv_iso(ud, urb) < 0) { urb->status = -EPROTO; goto error; } /* restore the padding in iso packets */ usbip_pad_iso(ud, urb); error: if (usbip_dbg_flag_vhci_rx) usbip_dump_urb(urb); if (urb->num_sgs) urb->transfer_flags &= ~URB_DMA_MAP_SG; usbip_dbg_vhci_rx("now giveback urb %u\n", pdu->base.seqnum); spin_lock_irqsave(&vhci->lock, flags); usb_hcd_unlink_urb_from_ep(vhci_hcd_to_hcd(vhci_hcd), urb); spin_unlock_irqrestore(&vhci->lock, flags); usb_hcd_giveback_urb(vhci_hcd_to_hcd(vhci_hcd), urb, urb->status); usbip_dbg_vhci_rx("Leave\n"); } static struct vhci_unlink *dequeue_pending_unlink(struct vhci_device *vdev, struct usbip_header *pdu) { struct vhci_unlink *unlink, *tmp; unsigned long flags; spin_lock_irqsave(&vdev->priv_lock, flags); list_for_each_entry_safe(unlink, tmp, &vdev->unlink_rx, list) { pr_info("unlink->seqnum %lu\n", unlink->seqnum); if (unlink->seqnum == pdu->base.seqnum) { usbip_dbg_vhci_rx("found pending unlink, %lu\n", unlink->seqnum); list_del(&unlink->list); spin_unlock_irqrestore(&vdev->priv_lock, flags); return unlink; } } spin_unlock_irqrestore(&vdev->priv_lock, flags); return NULL; } static void vhci_recv_ret_unlink(struct vhci_device *vdev, struct usbip_header *pdu) { struct vhci_hcd *vhci_hcd = vdev_to_vhci_hcd(vdev); struct vhci *vhci = vhci_hcd->vhci; struct vhci_unlink *unlink; struct urb *urb; unsigned long flags; usbip_dump_header(pdu); unlink = dequeue_pending_unlink(vdev, pdu); if (!unlink) { pr_info("cannot find the pending unlink %u\n", pdu->base.seqnum); return; } spin_lock_irqsave(&vdev->priv_lock, flags); urb = pickup_urb_and_free_priv(vdev, unlink->unlink_seqnum); spin_unlock_irqrestore(&vdev->priv_lock, flags); if (!urb) { /* * I get the result of a unlink request. But, it seems that I * already received the result of its submit result and gave * back the URB. */ pr_info("the urb (seqnum %u) was already given back\n", pdu->base.seqnum); } else { usbip_dbg_vhci_rx("now giveback urb %u\n", pdu->base.seqnum); /* If unlink is successful, status is -ECONNRESET */ urb->status = pdu->u.ret_unlink.status; pr_info("urb->status %d\n", urb->status); spin_lock_irqsave(&vhci->lock, flags); usb_hcd_unlink_urb_from_ep(vhci_hcd_to_hcd(vhci_hcd), urb); spin_unlock_irqrestore(&vhci->lock, flags); usb_hcd_giveback_urb(vhci_hcd_to_hcd(vhci_hcd), urb, urb->status); } kfree(unlink); } static int vhci_priv_tx_empty(struct vhci_device *vdev) { int empty = 0; unsigned long flags; spin_lock_irqsave(&vdev->priv_lock, flags); empty = list_empty(&vdev->priv_rx); spin_unlock_irqrestore(&vdev->priv_lock, flags); return empty; } /* recv a pdu */ static void vhci_rx_pdu(struct usbip_device *ud) { int ret; struct usbip_header pdu; struct vhci_device *vdev = container_of(ud, struct vhci_device, ud); usbip_dbg_vhci_rx("Enter\n"); memset(&pdu, 0, sizeof(pdu)); /* receive a pdu header */ ret = usbip_recv(ud->tcp_socket, &pdu, sizeof(pdu)); if (ret < 0) { if (ret == -ECONNRESET) pr_info("connection reset by peer\n"); else if (ret == -EAGAIN) { /* ignore if connection was idle */ if (vhci_priv_tx_empty(vdev)) return; pr_info("connection timed out with pending urbs\n"); } else if (ret != -ERESTARTSYS) pr_info("xmit failed %d\n", ret); usbip_event_add(ud, VDEV_EVENT_ERROR_TCP); return; } if (ret == 0) { pr_info("connection closed"); usbip_event_add(ud, VDEV_EVENT_DOWN); return; } if (ret != sizeof(pdu)) { pr_err("received pdu size is %d, should be %d\n", ret, (unsigned int)sizeof(pdu)); usbip_event_add(ud, VDEV_EVENT_ERROR_TCP); return; } usbip_header_correct_endian(&pdu, 0); if (usbip_dbg_flag_vhci_rx) usbip_dump_header(&pdu); switch (pdu.base.command) { case USBIP_RET_SUBMIT: vhci_recv_ret_submit(vdev, &pdu); break; case USBIP_RET_UNLINK: vhci_recv_ret_unlink(vdev, &pdu); break; default: /* NOT REACHED */ pr_err("unknown pdu %u\n", pdu.base.command); usbip_dump_header(&pdu); usbip_event_add(ud, VDEV_EVENT_ERROR_TCP); break; } } int vhci_rx_loop(void *data) { struct usbip_device *ud = data; while (!kthread_should_stop()) { if (usbip_event_happened(ud)) break; usbip_kcov_remote_start(ud); vhci_rx_pdu(ud); usbip_kcov_remote_stop(); } return 0; } |
| 19 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_IF_MACVLAN_H #define _LINUX_IF_MACVLAN_H #include <linux/if_link.h> #include <linux/if_vlan.h> #include <linux/list.h> #include <linux/netdevice.h> #include <linux/netlink.h> #include <net/netlink.h> #include <linux/u64_stats_sync.h> struct macvlan_port; #define MACVLAN_MC_FILTER_BITS 8 #define MACVLAN_MC_FILTER_SZ (1 << MACVLAN_MC_FILTER_BITS) struct macvlan_dev { struct net_device *dev; struct list_head list; struct hlist_node hlist; struct macvlan_port *port; struct net_device *lowerdev; netdevice_tracker dev_tracker; void *accel_priv; struct vlan_pcpu_stats __percpu *pcpu_stats; DECLARE_BITMAP(mc_filter, MACVLAN_MC_FILTER_SZ); netdev_features_t set_features; enum macvlan_mode mode; u16 flags; unsigned int macaddr_count; u32 bc_queue_len_req; #ifdef CONFIG_NET_POLL_CONTROLLER struct netpoll *netpoll; #endif }; static inline void macvlan_count_rx(const struct macvlan_dev *vlan, unsigned int len, bool success, bool multicast) { if (likely(success)) { struct vlan_pcpu_stats *pcpu_stats; pcpu_stats = get_cpu_ptr(vlan->pcpu_stats); u64_stats_update_begin(&pcpu_stats->syncp); u64_stats_inc(&pcpu_stats->rx_packets); u64_stats_add(&pcpu_stats->rx_bytes, len); if (multicast) u64_stats_inc(&pcpu_stats->rx_multicast); u64_stats_update_end(&pcpu_stats->syncp); put_cpu_ptr(vlan->pcpu_stats); } else { this_cpu_inc(vlan->pcpu_stats->rx_errors); } } extern void macvlan_common_setup(struct net_device *dev); struct rtnl_newlink_params; extern int macvlan_common_newlink(struct net_device *dev, struct rtnl_newlink_params *params, struct netlink_ext_ack *extack); extern void macvlan_dellink(struct net_device *dev, struct list_head *head); extern int macvlan_link_register(struct rtnl_link_ops *ops); #if IS_ENABLED(CONFIG_MACVLAN) static inline struct net_device * macvlan_dev_real_dev(const struct net_device *dev) { struct macvlan_dev *macvlan = netdev_priv(dev); return macvlan->lowerdev; } #else static inline struct net_device * macvlan_dev_real_dev(const struct net_device *dev) { BUG(); return NULL; } #endif static inline void *macvlan_accel_priv(struct net_device *dev) { struct macvlan_dev *macvlan = netdev_priv(dev); return macvlan->accel_priv; } static inline bool macvlan_supports_dest_filter(struct net_device *dev) { struct macvlan_dev *macvlan = netdev_priv(dev); return macvlan->mode == MACVLAN_MODE_PRIVATE || macvlan->mode == MACVLAN_MODE_VEPA || macvlan->mode == MACVLAN_MODE_BRIDGE; } static inline int macvlan_release_l2fw_offload(struct net_device *dev) { struct macvlan_dev *macvlan = netdev_priv(dev); macvlan->accel_priv = NULL; return dev_uc_add(macvlan->lowerdev, dev->dev_addr); } #endif /* _LINUX_IF_MACVLAN_H */ |
| 174 124 109 1 39 6 39 6 40 5 11 34 2 43 43 2 48 4 44 2 2 2 502 608 502 328 653 50 50 259 135 153 6 271 20 259 112 654 494 654 6134 6323 85 18 68 273 272 85 85 2 17 16 10 44 1 19 26 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 | // SPDX-License-Identifier: GPL-2.0-only /* * fs/kernfs/inode.c - kernfs inode implementation * * Copyright (c) 2001-3 Patrick Mochel * Copyright (c) 2007 SUSE Linux Products GmbH * Copyright (c) 2007, 2013 Tejun Heo <tj@kernel.org> */ #include <linux/pagemap.h> #include <linux/backing-dev.h> #include <linux/capability.h> #include <linux/errno.h> #include <linux/slab.h> #include <linux/xattr.h> #include <linux/security.h> #include "kernfs-internal.h" static const struct inode_operations kernfs_iops = { .permission = kernfs_iop_permission, .setattr = kernfs_iop_setattr, .getattr = kernfs_iop_getattr, .listxattr = kernfs_iop_listxattr, }; static struct kernfs_iattrs *__kernfs_iattrs(struct kernfs_node *kn, bool alloc) { struct kernfs_iattrs *ret __free(kfree) = NULL; struct kernfs_iattrs *attr; attr = READ_ONCE(kn->iattr); if (attr || !alloc) return attr; ret = kmem_cache_zalloc(kernfs_iattrs_cache, GFP_KERNEL); if (!ret) return NULL; /* assign default attributes */ ret->ia_uid = GLOBAL_ROOT_UID; ret->ia_gid = GLOBAL_ROOT_GID; ktime_get_real_ts64(&ret->ia_atime); ret->ia_mtime = ret->ia_atime; ret->ia_ctime = ret->ia_atime; simple_xattrs_init(&ret->xattrs); atomic_set(&ret->nr_user_xattrs, 0); atomic_set(&ret->user_xattr_size, 0); /* If someone raced us, recognize it. */ if (!try_cmpxchg(&kn->iattr, &attr, ret)) return READ_ONCE(kn->iattr); return no_free_ptr(ret); } static struct kernfs_iattrs *kernfs_iattrs(struct kernfs_node *kn) { return __kernfs_iattrs(kn, true); } static struct kernfs_iattrs *kernfs_iattrs_noalloc(struct kernfs_node *kn) { return __kernfs_iattrs(kn, false); } int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr) { struct kernfs_iattrs *attrs; unsigned int ia_valid = iattr->ia_valid; attrs = kernfs_iattrs(kn); if (!attrs) return -ENOMEM; if (ia_valid & ATTR_UID) attrs->ia_uid = iattr->ia_uid; if (ia_valid & ATTR_GID) attrs->ia_gid = iattr->ia_gid; if (ia_valid & ATTR_ATIME) attrs->ia_atime = iattr->ia_atime; if (ia_valid & ATTR_MTIME) attrs->ia_mtime = iattr->ia_mtime; if (ia_valid & ATTR_CTIME) attrs->ia_ctime = iattr->ia_ctime; if (ia_valid & ATTR_MODE) kn->mode = iattr->ia_mode; return 0; } /** * kernfs_setattr - set iattr on a node * @kn: target node * @iattr: iattr to set * * Return: %0 on success, -errno on failure. */ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr) { int ret; struct kernfs_root *root = kernfs_root(kn); down_write(&root->kernfs_iattr_rwsem); ret = __kernfs_setattr(kn, iattr); up_write(&root->kernfs_iattr_rwsem); return ret; } int kernfs_iop_setattr(struct mnt_idmap *idmap, struct dentry *dentry, struct iattr *iattr) { struct inode *inode = d_inode(dentry); struct kernfs_node *kn = inode->i_private; struct kernfs_root *root; int error; if (!kn) return -EINVAL; root = kernfs_root(kn); down_write(&root->kernfs_iattr_rwsem); error = setattr_prepare(&nop_mnt_idmap, dentry, iattr); if (error) goto out; error = __kernfs_setattr(kn, iattr); if (error) goto out; /* this ignores size changes */ setattr_copy(&nop_mnt_idmap, inode, iattr); out: up_write(&root->kernfs_iattr_rwsem); return error; } ssize_t kernfs_iop_listxattr(struct dentry *dentry, char *buf, size_t size) { struct kernfs_node *kn = kernfs_dentry_node(dentry); struct kernfs_iattrs *attrs; attrs = kernfs_iattrs(kn); if (!attrs) return -ENOMEM; return simple_xattr_list(d_inode(dentry), &attrs->xattrs, buf, size); } static inline void set_default_inode_attr(struct inode *inode, umode_t mode) { inode->i_mode = mode; simple_inode_init_ts(inode); } static inline void set_inode_attr(struct inode *inode, struct kernfs_iattrs *attrs) { inode->i_uid = attrs->ia_uid; inode->i_gid = attrs->ia_gid; inode_set_atime_to_ts(inode, attrs->ia_atime); inode_set_mtime_to_ts(inode, attrs->ia_mtime); inode_set_ctime_to_ts(inode, attrs->ia_ctime); } static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode) { struct kernfs_iattrs *attrs; inode->i_mode = kn->mode; attrs = kernfs_iattrs_noalloc(kn); if (attrs) /* * kernfs_node has non-default attributes get them from * persistent copy in kernfs_node. */ set_inode_attr(inode, attrs); if (kernfs_type(kn) == KERNFS_DIR) set_nlink(inode, kn->dir.subdirs + 2); } int kernfs_iop_getattr(struct mnt_idmap *idmap, const struct path *path, struct kstat *stat, u32 request_mask, unsigned int query_flags) { struct inode *inode = d_inode(path->dentry); struct kernfs_node *kn = inode->i_private; struct kernfs_root *root = kernfs_root(kn); down_read(&root->kernfs_iattr_rwsem); kernfs_refresh_inode(kn, inode); generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat); up_read(&root->kernfs_iattr_rwsem); return 0; } static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode) { kernfs_get(kn); inode->i_private = kn; inode->i_mapping->a_ops = &ram_aops; inode->i_op = &kernfs_iops; inode->i_generation = kernfs_gen(kn); set_default_inode_attr(inode, kn->mode); kernfs_refresh_inode(kn, inode); /* initialize inode according to type */ switch (kernfs_type(kn)) { case KERNFS_DIR: inode->i_op = &kernfs_dir_iops; inode->i_fop = &kernfs_dir_fops; if (kn->flags & KERNFS_EMPTY_DIR) make_empty_dir_inode(inode); break; case KERNFS_FILE: inode->i_size = kn->attr.size; inode->i_fop = &kernfs_file_fops; break; case KERNFS_LINK: inode->i_op = &kernfs_symlink_iops; break; default: BUG(); } unlock_new_inode(inode); } /** * kernfs_get_inode - get inode for kernfs_node * @sb: super block * @kn: kernfs_node to allocate inode for * * Get inode for @kn. If such inode doesn't exist, a new inode is * allocated and basics are initialized. New inode is returned * locked. * * Locking: * Kernel thread context (may sleep). * * Return: * Pointer to allocated inode on success, %NULL on failure. */ struct inode *kernfs_get_inode(struct super_block *sb, struct kernfs_node *kn) { struct inode *inode; inode = iget_locked(sb, kernfs_ino(kn)); if (inode && (inode->i_state & I_NEW)) kernfs_init_inode(kn, inode); return inode; } /* * The kernfs_node serves as both an inode and a directory entry for * kernfs. To prevent the kernfs inode numbers from being freed * prematurely we take a reference to kernfs_node from the kernfs inode. A * super_operations.evict_inode() implementation is needed to drop that * reference upon inode destruction. */ void kernfs_evict_inode(struct inode *inode) { struct kernfs_node *kn = inode->i_private; truncate_inode_pages_final(&inode->i_data); clear_inode(inode); kernfs_put(kn); } int kernfs_iop_permission(struct mnt_idmap *idmap, struct inode *inode, int mask) { struct kernfs_node *kn; struct kernfs_root *root; int ret; if (mask & MAY_NOT_BLOCK) return -ECHILD; kn = inode->i_private; root = kernfs_root(kn); down_read(&root->kernfs_iattr_rwsem); kernfs_refresh_inode(kn, inode); ret = generic_permission(&nop_mnt_idmap, inode, mask); up_read(&root->kernfs_iattr_rwsem); return ret; } int kernfs_xattr_get(struct kernfs_node *kn, const char *name, void *value, size_t size) { struct kernfs_iattrs *attrs = kernfs_iattrs_noalloc(kn); if (!attrs) return -ENODATA; return simple_xattr_get(&attrs->xattrs, name, value, size); } int kernfs_xattr_set(struct kernfs_node *kn, const char *name, const void *value, size_t size, int flags) { struct simple_xattr *old_xattr; struct kernfs_iattrs *attrs; attrs = kernfs_iattrs(kn); if (!attrs) return -ENOMEM; old_xattr = simple_xattr_set(&attrs->xattrs, name, value, size, flags); if (IS_ERR(old_xattr)) return PTR_ERR(old_xattr); simple_xattr_free(old_xattr); return 0; } static int kernfs_vfs_xattr_get(const struct xattr_handler *handler, struct dentry *unused, struct inode *inode, const char *suffix, void *value, size_t size) { const char *name = xattr_full_name(handler, suffix); struct kernfs_node *kn = inode->i_private; return kernfs_xattr_get(kn, name, value, size); } static int kernfs_vfs_xattr_set(const struct xattr_handler *handler, struct mnt_idmap *idmap, struct dentry *unused, struct inode *inode, const char *suffix, const void *value, size_t size, int flags) { const char *name = xattr_full_name(handler, suffix); struct kernfs_node *kn = inode->i_private; return kernfs_xattr_set(kn, name, value, size, flags); } static int kernfs_vfs_user_xattr_add(struct kernfs_node *kn, const char *full_name, struct simple_xattrs *xattrs, const void *value, size_t size, int flags) { struct kernfs_iattrs *attr = kernfs_iattrs_noalloc(kn); atomic_t *sz = &attr->user_xattr_size; atomic_t *nr = &attr->nr_user_xattrs; struct simple_xattr *old_xattr; int ret; if (atomic_inc_return(nr) > KERNFS_MAX_USER_XATTRS) { ret = -ENOSPC; goto dec_count_out; } if (atomic_add_return(size, sz) > KERNFS_USER_XATTR_SIZE_LIMIT) { ret = -ENOSPC; goto dec_size_out; } old_xattr = simple_xattr_set(xattrs, full_name, value, size, flags); if (!old_xattr) return 0; if (IS_ERR(old_xattr)) { ret = PTR_ERR(old_xattr); goto dec_size_out; } ret = 0; size = old_xattr->size; simple_xattr_free(old_xattr); dec_size_out: atomic_sub(size, sz); dec_count_out: atomic_dec(nr); return ret; } static int kernfs_vfs_user_xattr_rm(struct kernfs_node *kn, const char *full_name, struct simple_xattrs *xattrs, const void *value, size_t size, int flags) { struct kernfs_iattrs *attr = kernfs_iattrs_noalloc(kn); atomic_t *sz = &attr->user_xattr_size; atomic_t *nr = &attr->nr_user_xattrs; struct simple_xattr *old_xattr; old_xattr = simple_xattr_set(xattrs, full_name, value, size, flags); if (!old_xattr) return 0; if (IS_ERR(old_xattr)) return PTR_ERR(old_xattr); atomic_sub(old_xattr->size, sz); atomic_dec(nr); simple_xattr_free(old_xattr); return 0; } static int kernfs_vfs_user_xattr_set(const struct xattr_handler *handler, struct mnt_idmap *idmap, struct dentry *unused, struct inode *inode, const char *suffix, const void *value, size_t size, int flags) { const char *full_name = xattr_full_name(handler, suffix); struct kernfs_node *kn = inode->i_private; struct kernfs_iattrs *attrs; if (!(kernfs_root(kn)->flags & KERNFS_ROOT_SUPPORT_USER_XATTR)) return -EOPNOTSUPP; attrs = kernfs_iattrs(kn); if (!attrs) return -ENOMEM; if (value) return kernfs_vfs_user_xattr_add(kn, full_name, &attrs->xattrs, value, size, flags); else return kernfs_vfs_user_xattr_rm(kn, full_name, &attrs->xattrs, value, size, flags); } static const struct xattr_handler kernfs_trusted_xattr_handler = { .prefix = XATTR_TRUSTED_PREFIX, .get = kernfs_vfs_xattr_get, .set = kernfs_vfs_xattr_set, }; static const struct xattr_handler kernfs_security_xattr_handler = { .prefix = XATTR_SECURITY_PREFIX, .get = kernfs_vfs_xattr_get, .set = kernfs_vfs_xattr_set, }; static const struct xattr_handler kernfs_user_xattr_handler = { .prefix = XATTR_USER_PREFIX, .get = kernfs_vfs_xattr_get, .set = kernfs_vfs_user_xattr_set, }; const struct xattr_handler * const kernfs_xattr_handlers[] = { &kernfs_trusted_xattr_handler, &kernfs_security_xattr_handler, &kernfs_user_xattr_handler, NULL }; |
| 2 113 859 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_HIGHMEM_INTERNAL_H #define _LINUX_HIGHMEM_INTERNAL_H /* * Outside of CONFIG_HIGHMEM to support X86 32bit iomap_atomic() cruft. */ #ifdef CONFIG_KMAP_LOCAL void *__kmap_local_pfn_prot(unsigned long pfn, pgprot_t prot); void *__kmap_local_page_prot(struct page *page, pgprot_t prot); void kunmap_local_indexed(const void *vaddr); void kmap_local_fork(struct task_struct *tsk); void __kmap_local_sched_out(void); void __kmap_local_sched_in(void); static inline void kmap_assert_nomap(void) { DEBUG_LOCKS_WARN_ON(current->kmap_ctrl.idx); } #else static inline void kmap_local_fork(struct task_struct *tsk) { } static inline void kmap_assert_nomap(void) { } #endif #ifdef CONFIG_HIGHMEM #include <asm/highmem.h> #ifndef ARCH_HAS_KMAP_FLUSH_TLB static inline void kmap_flush_tlb(unsigned long addr) { } #endif #ifndef kmap_prot #define kmap_prot PAGE_KERNEL #endif void *kmap_high(struct page *page); void kunmap_high(struct page *page); void __kmap_flush_unused(void); struct page *__kmap_to_page(void *addr); static inline void *kmap(struct page *page) { void *addr; might_sleep(); if (!PageHighMem(page)) addr = page_address(page); else addr = kmap_high(page); kmap_flush_tlb((unsigned long)addr); return addr; } static inline void kunmap(struct page *page) { might_sleep(); if (!PageHighMem(page)) return; kunmap_high(page); } static inline struct page *kmap_to_page(void *addr) { return __kmap_to_page(addr); } static inline void kmap_flush_unused(void) { __kmap_flush_unused(); } static inline void *kmap_local_page(struct page *page) { return __kmap_local_page_prot(page, kmap_prot); } static inline void *kmap_local_page_try_from_panic(struct page *page) { if (!PageHighMem(page)) return page_address(page); /* If the page is in HighMem, it's not safe to kmap it.*/ return NULL; } static inline void *kmap_local_folio(struct folio *folio, size_t offset) { struct page *page = folio_page(folio, offset / PAGE_SIZE); return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE; } static inline void *kmap_local_page_prot(struct page *page, pgprot_t prot) { return __kmap_local_page_prot(page, prot); } static inline void *kmap_local_pfn(unsigned long pfn) { return __kmap_local_pfn_prot(pfn, kmap_prot); } static inline void __kunmap_local(const void *vaddr) { kunmap_local_indexed(vaddr); } static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot) { if (IS_ENABLED(CONFIG_PREEMPT_RT)) migrate_disable(); else preempt_disable(); pagefault_disable(); return __kmap_local_page_prot(page, prot); } static inline void *kmap_atomic(struct page *page) { return kmap_atomic_prot(page, kmap_prot); } static inline void *kmap_atomic_pfn(unsigned long pfn) { if (IS_ENABLED(CONFIG_PREEMPT_RT)) migrate_disable(); else preempt_disable(); pagefault_disable(); return __kmap_local_pfn_prot(pfn, kmap_prot); } static inline void __kunmap_atomic(const void *addr) { kunmap_local_indexed(addr); pagefault_enable(); if (IS_ENABLED(CONFIG_PREEMPT_RT)) migrate_enable(); else preempt_enable(); } unsigned long __nr_free_highpages(void); unsigned long __totalhigh_pages(void); static inline unsigned long nr_free_highpages(void) { return __nr_free_highpages(); } static inline unsigned long totalhigh_pages(void) { return __totalhigh_pages(); } static inline bool is_kmap_addr(const void *x) { unsigned long addr = (unsigned long)x; return (addr >= PKMAP_ADDR(0) && addr < PKMAP_ADDR(LAST_PKMAP)) || (addr >= __fix_to_virt(FIX_KMAP_END) && addr < __fix_to_virt(FIX_KMAP_BEGIN)); } #else /* CONFIG_HIGHMEM */ static inline struct page *kmap_to_page(void *addr) { return virt_to_page(addr); } static inline void *kmap(struct page *page) { might_sleep(); return page_address(page); } static inline void kunmap_high(struct page *page) { } static inline void kmap_flush_unused(void) { } static inline void kunmap(struct page *page) { #ifdef ARCH_HAS_FLUSH_ON_KUNMAP kunmap_flush_on_unmap(page_address(page)); #endif } static inline void *kmap_local_page(struct page *page) { return page_address(page); } static inline void *kmap_local_page_try_from_panic(struct page *page) { return page_address(page); } static inline void *kmap_local_folio(struct folio *folio, size_t offset) { return folio_address(folio) + offset; } static inline void *kmap_local_page_prot(struct page *page, pgprot_t prot) { return kmap_local_page(page); } static inline void *kmap_local_pfn(unsigned long pfn) { return kmap_local_page(pfn_to_page(pfn)); } static inline void __kunmap_local(const void *addr) { #ifdef ARCH_HAS_FLUSH_ON_KUNMAP kunmap_flush_on_unmap(PTR_ALIGN_DOWN(addr, PAGE_SIZE)); #endif } static inline void *kmap_atomic(struct page *page) { if (IS_ENABLED(CONFIG_PREEMPT_RT)) migrate_disable(); else preempt_disable(); pagefault_disable(); return page_address(page); } static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot) { return kmap_atomic(page); } static inline void *kmap_atomic_pfn(unsigned long pfn) { return kmap_atomic(pfn_to_page(pfn)); } static inline void __kunmap_atomic(const void *addr) { #ifdef ARCH_HAS_FLUSH_ON_KUNMAP kunmap_flush_on_unmap(PTR_ALIGN_DOWN(addr, PAGE_SIZE)); #endif pagefault_enable(); if (IS_ENABLED(CONFIG_PREEMPT_RT)) migrate_enable(); else preempt_enable(); } static inline unsigned long nr_free_highpages(void) { return 0; } static inline unsigned long totalhigh_pages(void) { return 0; } static inline bool is_kmap_addr(const void *x) { return false; } #endif /* CONFIG_HIGHMEM */ /** * kunmap_atomic - Unmap the virtual address mapped by kmap_atomic() - deprecated! * @__addr: Virtual address to be unmapped * * Unmaps an address previously mapped by kmap_atomic() and re-enables * pagefaults. Depending on PREEMP_RT configuration, re-enables also * migration and preemption. Users should not count on these side effects. * * Mappings should be unmapped in the reverse order that they were mapped. * See kmap_local_page() for details on nesting. * * @__addr can be any address within the mapped page, so there is no need * to subtract any offset that has been added. In contrast to kunmap(), * this function takes the address returned from kmap_atomic(), not the * page passed to it. The compiler will warn you if you pass the page. */ #define kunmap_atomic(__addr) \ do { \ BUILD_BUG_ON(__same_type((__addr), struct page *)); \ __kunmap_atomic(__addr); \ } while (0) /** * kunmap_local - Unmap a page mapped via kmap_local_page(). * @__addr: An address within the page mapped * * @__addr can be any address within the mapped page. Commonly it is the * address return from kmap_local_page(), but it can also include offsets. * * Unmapping should be done in the reverse order of the mapping. See * kmap_local_page() for details. */ #define kunmap_local(__addr) \ do { \ BUILD_BUG_ON(__same_type((__addr), struct page *)); \ __kunmap_local(__addr); \ } while (0) #endif |
| 23 11 24 1 1 4 18 23 621 353 92 1 2 2 8 4 15 5 1 9 26 2 76 361 76 530 77 120 438 437 8 441 2 620 174 174 125 74 175 11 357 357 11 368 621 52 175 175 46 634 45 52 45 47 687 53 53 53 53 53 52 52 52 2 52 644 643 645 52 46 52 645 63 63 217 175 45 269 164 147 46 47 47 47 46 2 47 163 3 45 686 217 632 688 47 687 297 621 258 79 259 259 19 19 85 85 91 20 71 1 1 79 79 15 72 273 273 273 273 60 60 71 71 71 71 143 143 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 | // SPDX-License-Identifier: GPL-2.0+ /* * (C) Copyright Linus Torvalds 1999 * (C) Copyright Johannes Erdfelt 1999-2001 * (C) Copyright Andreas Gal 1999 * (C) Copyright Gregory P. Smith 1999 * (C) Copyright Deti Fliegl 1999 * (C) Copyright Randy Dunlap 2000 * (C) Copyright David Brownell 2000-2002 */ #include <linux/bcd.h> #include <linux/module.h> #include <linux/version.h> #include <linux/kernel.h> #include <linux/sched/task_stack.h> #include <linux/slab.h> #include <linux/completion.h> #include <linux/utsname.h> #include <linux/mm.h> #include <asm/io.h> #include <linux/device.h> #include <linux/dma-mapping.h> #include <linux/mutex.h> #include <asm/irq.h> #include <asm/byteorder.h> #include <linux/unaligned.h> #include <linux/platform_device.h> #include <linux/workqueue.h> #include <linux/pm_runtime.h> #include <linux/types.h> #include <linux/genalloc.h> #include <linux/io.h> #include <linux/kcov.h> #include <linux/phy/phy.h> #include <linux/usb.h> #include <linux/usb/hcd.h> #include <linux/usb/otg.h> #include "usb.h" #include "phy.h" /*-------------------------------------------------------------------------*/ /* * USB Host Controller Driver framework * * Plugs into usbcore (usb_bus) and lets HCDs share code, minimizing * HCD-specific behaviors/bugs. * * This does error checks, tracks devices and urbs, and delegates to a * "hc_driver" only for code (and data) that really needs to know about * hardware differences. That includes root hub registers, i/o queues, * and so on ... but as little else as possible. * * Shared code includes most of the "root hub" code (these are emulated, * though each HC's hardware works differently) and PCI glue, plus request * tracking overhead. The HCD code should only block on spinlocks or on * hardware handshaking; blocking on software events (such as other kernel * threads releasing resources, or completing actions) is all generic. * * Happens the USB 2.0 spec says this would be invisible inside the "USBD", * and includes mostly a "HCDI" (HCD Interface) along with some APIs used * only by the hub driver ... and that neither should be seen or used by * usb client device drivers. * * Contributors of ideas or unattributed patches include: David Brownell, * Roman Weissgaerber, Rory Bolt, Greg Kroah-Hartman, ... * * HISTORY: * 2002-02-21 Pull in most of the usb_bus support from usb.c; some * associated cleanup. "usb_hcd" still != "usb_bus". * 2001-12-12 Initial patch version for Linux 2.5.1 kernel. */ /*-------------------------------------------------------------------------*/ /* Keep track of which host controller drivers are loaded */ unsigned long usb_hcds_loaded; EXPORT_SYMBOL_GPL(usb_hcds_loaded); /* host controllers we manage */ DEFINE_IDR (usb_bus_idr); EXPORT_SYMBOL_GPL (usb_bus_idr); /* used when allocating bus numbers */ #define USB_MAXBUS 64 /* used when updating list of hcds */ DEFINE_MUTEX(usb_bus_idr_lock); /* exported only for usbfs */ EXPORT_SYMBOL_GPL (usb_bus_idr_lock); /* used for controlling access to virtual root hubs */ static DEFINE_SPINLOCK(hcd_root_hub_lock); /* used when updating an endpoint's URB list */ static DEFINE_SPINLOCK(hcd_urb_list_lock); /* used to protect against unlinking URBs after the device is gone */ static DEFINE_SPINLOCK(hcd_urb_unlink_lock); /* wait queue for synchronous unlinks */ DECLARE_WAIT_QUEUE_HEAD(usb_kill_urb_queue); /*-------------------------------------------------------------------------*/ /* * Sharable chunks of root hub code. */ /*-------------------------------------------------------------------------*/ #define KERNEL_REL bin2bcd(LINUX_VERSION_MAJOR) #define KERNEL_VER bin2bcd(LINUX_VERSION_PATCHLEVEL) /* usb 3.1 root hub device descriptor */ static const u8 usb31_rh_dev_descriptor[18] = { 0x12, /* __u8 bLength; */ USB_DT_DEVICE, /* __u8 bDescriptorType; Device */ 0x10, 0x03, /* __le16 bcdUSB; v3.1 */ 0x09, /* __u8 bDeviceClass; HUB_CLASSCODE */ 0x00, /* __u8 bDeviceSubClass; */ 0x03, /* __u8 bDeviceProtocol; USB 3 hub */ 0x09, /* __u8 bMaxPacketSize0; 2^9 = 512 Bytes */ 0x6b, 0x1d, /* __le16 idVendor; Linux Foundation 0x1d6b */ 0x03, 0x00, /* __le16 idProduct; device 0x0003 */ KERNEL_VER, KERNEL_REL, /* __le16 bcdDevice */ 0x03, /* __u8 iManufacturer; */ 0x02, /* __u8 iProduct; */ 0x01, /* __u8 iSerialNumber; */ 0x01 /* __u8 bNumConfigurations; */ }; /* usb 3.0 root hub device descriptor */ static const u8 usb3_rh_dev_descriptor[18] = { 0x12, /* __u8 bLength; */ USB_DT_DEVICE, /* __u8 bDescriptorType; Device */ 0x00, 0x03, /* __le16 bcdUSB; v3.0 */ 0x09, /* __u8 bDeviceClass; HUB_CLASSCODE */ 0x00, /* __u8 bDeviceSubClass; */ 0x03, /* __u8 bDeviceProtocol; USB 3.0 hub */ 0x09, /* __u8 bMaxPacketSize0; 2^9 = 512 Bytes */ 0x6b, 0x1d, /* __le16 idVendor; Linux Foundation 0x1d6b */ 0x03, 0x00, /* __le16 idProduct; device 0x0003 */ KERNEL_VER, KERNEL_REL, /* __le16 bcdDevice */ 0x03, /* __u8 iManufacturer; */ 0x02, /* __u8 iProduct; */ 0x01, /* __u8 iSerialNumber; */ 0x01 /* __u8 bNumConfigurations; */ }; /* usb 2.0 root hub device descriptor */ static const u8 usb2_rh_dev_descriptor[18] = { 0x12, /* __u8 bLength; */ USB_DT_DEVICE, /* __u8 bDescriptorType; Device */ 0x00, 0x02, /* __le16 bcdUSB; v2.0 */ 0x09, /* __u8 bDeviceClass; HUB_CLASSCODE */ 0x00, /* __u8 bDeviceSubClass; */ 0x00, /* __u8 bDeviceProtocol; [ usb 2.0 no TT ] */ 0x40, /* __u8 bMaxPacketSize0; 64 Bytes */ 0x6b, 0x1d, /* __le16 idVendor; Linux Foundation 0x1d6b */ 0x02, 0x00, /* __le16 idProduct; device 0x0002 */ KERNEL_VER, KERNEL_REL, /* __le16 bcdDevice */ 0x03, /* __u8 iManufacturer; */ 0x02, /* __u8 iProduct; */ 0x01, /* __u8 iSerialNumber; */ 0x01 /* __u8 bNumConfigurations; */ }; /* no usb 2.0 root hub "device qualifier" descriptor: one speed only */ /* usb 1.1 root hub device descriptor */ static const u8 usb11_rh_dev_descriptor[18] = { 0x12, /* __u8 bLength; */ USB_DT_DEVICE, /* __u8 bDescriptorType; Device */ 0x10, 0x01, /* __le16 bcdUSB; v1.1 */ 0x09, /* __u8 bDeviceClass; HUB_CLASSCODE */ 0x00, /* __u8 bDeviceSubClass; */ 0x00, /* __u8 bDeviceProtocol; [ low/full speeds only ] */ 0x40, /* __u8 bMaxPacketSize0; 64 Bytes */ 0x6b, 0x1d, /* __le16 idVendor; Linux Foundation 0x1d6b */ 0x01, 0x00, /* __le16 idProduct; device 0x0001 */ KERNEL_VER, KERNEL_REL, /* __le16 bcdDevice */ 0x03, /* __u8 iManufacturer; */ 0x02, /* __u8 iProduct; */ 0x01, /* __u8 iSerialNumber; */ 0x01 /* __u8 bNumConfigurations; */ }; /*-------------------------------------------------------------------------*/ /* Configuration descriptors for our root hubs */ static const u8 fs_rh_config_descriptor[] = { /* one configuration */ 0x09, /* __u8 bLength; */ USB_DT_CONFIG, /* __u8 bDescriptorType; Configuration */ 0x19, 0x00, /* __le16 wTotalLength; */ 0x01, /* __u8 bNumInterfaces; (1) */ 0x01, /* __u8 bConfigurationValue; */ 0x00, /* __u8 iConfiguration; */ 0xc0, /* __u8 bmAttributes; Bit 7: must be set, 6: Self-powered, 5: Remote wakeup, 4..0: resvd */ 0x00, /* __u8 MaxPower; */ /* USB 1.1: * USB 2.0, single TT organization (mandatory): * one interface, protocol 0 * * USB 2.0, multiple TT organization (optional): * two interfaces, protocols 1 (like single TT) * and 2 (multiple TT mode) ... config is * sometimes settable * NOT IMPLEMENTED */ /* one interface */ 0x09, /* __u8 if_bLength; */ USB_DT_INTERFACE, /* __u8 if_bDescriptorType; Interface */ 0x00, /* __u8 if_bInterfaceNumber; */ 0x00, /* __u8 if_bAlternateSetting; */ 0x01, /* __u8 if_bNumEndpoints; */ 0x09, /* __u8 if_bInterfaceClass; HUB_CLASSCODE */ 0x00, /* __u8 if_bInterfaceSubClass; */ 0x00, /* __u8 if_bInterfaceProtocol; [usb1.1 or single tt] */ 0x00, /* __u8 if_iInterface; */ /* one endpoint (status change endpoint) */ 0x07, /* __u8 ep_bLength; */ USB_DT_ENDPOINT, /* __u8 ep_bDescriptorType; Endpoint */ 0x81, /* __u8 ep_bEndpointAddress; IN Endpoint 1 */ 0x03, /* __u8 ep_bmAttributes; Interrupt */ 0x02, 0x00, /* __le16 ep_wMaxPacketSize; 1 + (MAX_ROOT_PORTS / 8) */ 0xff /* __u8 ep_bInterval; (255ms -- usb 2.0 spec) */ }; static const u8 hs_rh_config_descriptor[] = { /* one configuration */ 0x09, /* __u8 bLength; */ USB_DT_CONFIG, /* __u8 bDescriptorType; Configuration */ 0x19, 0x00, /* __le16 wTotalLength; */ 0x01, /* __u8 bNumInterfaces; (1) */ 0x01, /* __u8 bConfigurationValue; */ 0x00, /* __u8 iConfiguration; */ 0xc0, /* __u8 bmAttributes; Bit 7: must be set, 6: Self-powered, 5: Remote wakeup, 4..0: resvd */ 0x00, /* __u8 MaxPower; */ /* USB 1.1: * USB 2.0, single TT organization (mandatory): * one interface, protocol 0 * * USB 2.0, multiple TT organization (optional): * two interfaces, protocols 1 (like single TT) * and 2 (multiple TT mode) ... config is * sometimes settable * NOT IMPLEMENTED */ /* one interface */ 0x09, /* __u8 if_bLength; */ USB_DT_INTERFACE, /* __u8 if_bDescriptorType; Interface */ 0x00, /* __u8 if_bInterfaceNumber; */ 0x00, /* __u8 if_bAlternateSetting; */ 0x01, /* __u8 if_bNumEndpoints; */ 0x09, /* __u8 if_bInterfaceClass; HUB_CLASSCODE */ 0x00, /* __u8 if_bInterfaceSubClass; */ 0x00, /* __u8 if_bInterfaceProtocol; [usb1.1 or single tt] */ 0x00, /* __u8 if_iInterface; */ /* one endpoint (status change endpoint) */ 0x07, /* __u8 ep_bLength; */ USB_DT_ENDPOINT, /* __u8 ep_bDescriptorType; Endpoint */ 0x81, /* __u8 ep_bEndpointAddress; IN Endpoint 1 */ 0x03, /* __u8 ep_bmAttributes; Interrupt */ /* __le16 ep_wMaxPacketSize; 1 + (MAX_ROOT_PORTS / 8) * see hub.c:hub_configure() for details. */ (USB_MAXCHILDREN + 1 + 7) / 8, 0x00, 0x0c /* __u8 ep_bInterval; (256ms -- usb 2.0 spec) */ }; static const u8 ss_rh_config_descriptor[] = { /* one configuration */ 0x09, /* __u8 bLength; */ USB_DT_CONFIG, /* __u8 bDescriptorType; Configuration */ 0x1f, 0x00, /* __le16 wTotalLength; */ 0x01, /* __u8 bNumInterfaces; (1) */ 0x01, /* __u8 bConfigurationValue; */ 0x00, /* __u8 iConfiguration; */ 0xc0, /* __u8 bmAttributes; Bit 7: must be set, 6: Self-powered, 5: Remote wakeup, 4..0: resvd */ 0x00, /* __u8 MaxPower; */ /* one interface */ 0x09, /* __u8 if_bLength; */ USB_DT_INTERFACE, /* __u8 if_bDescriptorType; Interface */ 0x00, /* __u8 if_bInterfaceNumber; */ 0x00, /* __u8 if_bAlternateSetting; */ 0x01, /* __u8 if_bNumEndpoints; */ 0x09, /* __u8 if_bInterfaceClass; HUB_CLASSCODE */ 0x00, /* __u8 if_bInterfaceSubClass; */ 0x00, /* __u8 if_bInterfaceProtocol; */ 0x00, /* __u8 if_iInterface; */ /* one endpoint (status change endpoint) */ 0x07, /* __u8 ep_bLength; */ USB_DT_ENDPOINT, /* __u8 ep_bDescriptorType; Endpoint */ 0x81, /* __u8 ep_bEndpointAddress; IN Endpoint 1 */ 0x03, /* __u8 ep_bmAttributes; Interrupt */ /* __le16 ep_wMaxPacketSize; 1 + (MAX_ROOT_PORTS / 8) * see hub.c:hub_configure() for details. */ (USB_MAXCHILDREN + 1 + 7) / 8, 0x00, 0x0c, /* __u8 ep_bInterval; (256ms -- usb 2.0 spec) */ /* one SuperSpeed endpoint companion descriptor */ 0x06, /* __u8 ss_bLength */ USB_DT_SS_ENDPOINT_COMP, /* __u8 ss_bDescriptorType; SuperSpeed EP */ /* Companion */ 0x00, /* __u8 ss_bMaxBurst; allows 1 TX between ACKs */ 0x00, /* __u8 ss_bmAttributes; 1 packet per service interval */ 0x02, 0x00 /* __le16 ss_wBytesPerInterval; 15 bits for max 15 ports */ }; /* authorized_default behaviour: * -1 is authorized for all devices (leftover from wireless USB) * 0 is unauthorized for all devices * 1 is authorized for all devices * 2 is authorized for internal devices */ #define USB_AUTHORIZE_WIRED -1 #define USB_AUTHORIZE_NONE 0 #define USB_AUTHORIZE_ALL 1 #define USB_AUTHORIZE_INTERNAL 2 static int authorized_default = CONFIG_USB_DEFAULT_AUTHORIZATION_MODE; module_param(authorized_default, int, S_IRUGO|S_IWUSR); MODULE_PARM_DESC(authorized_default, "Default USB device authorization: 0 is not authorized, 1 is authorized (default), 2 is authorized for internal devices, -1 is authorized (same as 1)"); /*-------------------------------------------------------------------------*/ /** * ascii2desc() - Helper routine for producing UTF-16LE string descriptors * @s: Null-terminated ASCII (actually ISO-8859-1) string * @buf: Buffer for USB string descriptor (header + UTF-16LE) * @len: Length (in bytes; may be odd) of descriptor buffer. * * Return: The number of bytes filled in: 2 + 2*strlen(s) or @len, * whichever is less. * * Note: * USB String descriptors can contain at most 126 characters; input * strings longer than that are truncated. */ static unsigned ascii2desc(char const *s, u8 *buf, unsigned len) { unsigned n, t = 2 + 2*strlen(s); if (t > 254) t = 254; /* Longest possible UTF string descriptor */ if (len > t) len = t; t += USB_DT_STRING << 8; /* Now t is first 16 bits to store */ n = len; while (n--) { *buf++ = t; if (!n--) break; *buf++ = t >> 8; t = (unsigned char)*s++; } return len; } /** * rh_string() - provides string descriptors for root hub * @id: the string ID number (0: langids, 1: serial #, 2: product, 3: vendor) * @hcd: the host controller for this root hub * @data: buffer for output packet * @len: length of the provided buffer * * Produces either a manufacturer, product or serial number string for the * virtual root hub device. * * Return: The number of bytes filled in: the length of the descriptor or * of the provided buffer, whichever is less. */ static unsigned rh_string(int id, struct usb_hcd const *hcd, u8 *data, unsigned len) { char buf[160]; char const *s; static char const langids[4] = {4, USB_DT_STRING, 0x09, 0x04}; /* language ids */ switch (id) { case 0: /* Array of LANGID codes (0x0409 is MSFT-speak for "en-us") */ /* See http://www.usb.org/developers/docs/USB_LANGIDs.pdf */ if (len > 4) len = 4; memcpy(data, langids, len); return len; case 1: /* Serial number */ s = hcd->self.bus_name; break; case 2: /* Product name */ s = hcd->product_desc; break; case 3: /* Manufacturer */ snprintf (buf, sizeof buf, "%s %s %s", init_utsname()->sysname, init_utsname()->release, hcd->driver->description); s = buf; break; default: /* Can't happen; caller guarantees it */ return 0; } return ascii2desc(s, data, len); } /* Root hub control transfers execute synchronously */ static int rh_call_control (struct usb_hcd *hcd, struct urb *urb) { struct usb_ctrlrequest *cmd; u16 typeReq, wValue, wIndex, wLength; u8 *ubuf = urb->transfer_buffer; unsigned len = 0; int status; u8 patch_wakeup = 0; u8 patch_protocol = 0; u16 tbuf_size; u8 *tbuf = NULL; const u8 *bufp; might_sleep(); spin_lock_irq(&hcd_root_hub_lock); status = usb_hcd_link_urb_to_ep(hcd, urb); spin_unlock_irq(&hcd_root_hub_lock); if (status) return status; urb->hcpriv = hcd; /* Indicate it's queued */ cmd = (struct usb_ctrlrequest *) urb->setup_packet; typeReq = (cmd->bRequestType << 8) | cmd->bRequest; wValue = le16_to_cpu (cmd->wValue); wIndex = le16_to_cpu (cmd->wIndex); wLength = le16_to_cpu (cmd->wLength); if (wLength > urb->transfer_buffer_length) goto error; /* * tbuf should be at least as big as the * USB hub descriptor. */ tbuf_size = max_t(u16, sizeof(struct usb_hub_descriptor), wLength); tbuf = kzalloc(tbuf_size, GFP_KERNEL); if (!tbuf) { status = -ENOMEM; goto err_alloc; } bufp = tbuf; urb->actual_length = 0; switch (typeReq) { /* DEVICE REQUESTS */ /* The root hub's remote wakeup enable bit is implemented using * driver model wakeup flags. If this system supports wakeup * through USB, userspace may change the default "allow wakeup" * policy through sysfs or these calls. * * Most root hubs support wakeup from downstream devices, for * runtime power management (disabling USB clocks and reducing * VBUS power usage). However, not all of them do so; silicon, * board, and BIOS bugs here are not uncommon, so these can't * be treated quite like external hubs. * * Likewise, not all root hubs will pass wakeup events upstream, * to wake up the whole system. So don't assume root hub and * controller capabilities are identical. */ case DeviceRequest | USB_REQ_GET_STATUS: tbuf[0] = (device_may_wakeup(&hcd->self.root_hub->dev) << USB_DEVICE_REMOTE_WAKEUP) | (1 << USB_DEVICE_SELF_POWERED); tbuf[1] = 0; len = 2; break; case DeviceOutRequest | USB_REQ_CLEAR_FEATURE: if (wValue == USB_DEVICE_REMOTE_WAKEUP) device_set_wakeup_enable(&hcd->self.root_hub->dev, 0); else goto error; break; case DeviceOutRequest | USB_REQ_SET_FEATURE: if (device_can_wakeup(&hcd->self.root_hub->dev) && wValue == USB_DEVICE_REMOTE_WAKEUP) device_set_wakeup_enable(&hcd->self.root_hub->dev, 1); else goto error; break; case DeviceRequest | USB_REQ_GET_CONFIGURATION: tbuf[0] = 1; len = 1; fallthrough; case DeviceOutRequest | USB_REQ_SET_CONFIGURATION: break; case DeviceRequest | USB_REQ_GET_DESCRIPTOR: switch (wValue & 0xff00) { case USB_DT_DEVICE << 8: switch (hcd->speed) { case HCD_USB32: case HCD_USB31: bufp = usb31_rh_dev_descriptor; break; case HCD_USB3: bufp = usb3_rh_dev_descriptor; break; case HCD_USB2: bufp = usb2_rh_dev_descriptor; break; case HCD_USB11: bufp = usb11_rh_dev_descriptor; break; default: goto error; } len = 18; if (hcd->has_tt) patch_protocol = 1; break; case USB_DT_CONFIG << 8: switch (hcd->speed) { case HCD_USB32: case HCD_USB31: case HCD_USB3: bufp = ss_rh_config_descriptor; len = sizeof ss_rh_config_descriptor; break; case HCD_USB2: bufp = hs_rh_config_descriptor; len = sizeof hs_rh_config_descriptor; break; case HCD_USB11: bufp = fs_rh_config_descriptor; len = sizeof fs_rh_config_descriptor; break; default: goto error; } if (device_can_wakeup(&hcd->self.root_hub->dev)) patch_wakeup = 1; break; case USB_DT_STRING << 8: if ((wValue & 0xff) < 4) urb->actual_length = rh_string(wValue & 0xff, hcd, ubuf, wLength); else /* unsupported IDs --> "protocol stall" */ goto error; break; case USB_DT_BOS << 8: goto nongeneric; default: goto error; } break; case DeviceRequest | USB_REQ_GET_INTERFACE: tbuf[0] = 0; len = 1; fallthrough; case DeviceOutRequest | USB_REQ_SET_INTERFACE: break; case DeviceOutRequest | USB_REQ_SET_ADDRESS: /* wValue == urb->dev->devaddr */ dev_dbg (hcd->self.controller, "root hub device address %d\n", wValue); break; /* INTERFACE REQUESTS (no defined feature/status flags) */ /* ENDPOINT REQUESTS */ case EndpointRequest | USB_REQ_GET_STATUS: /* ENDPOINT_HALT flag */ tbuf[0] = 0; tbuf[1] = 0; len = 2; fallthrough; case EndpointOutRequest | USB_REQ_CLEAR_FEATURE: case EndpointOutRequest | USB_REQ_SET_FEATURE: dev_dbg (hcd->self.controller, "no endpoint features yet\n"); break; /* CLASS REQUESTS (and errors) */ default: nongeneric: /* non-generic request */ switch (typeReq) { case GetHubStatus: len = 4; break; case GetPortStatus: if (wValue == HUB_PORT_STATUS) len = 4; else /* other port status types return 8 bytes */ len = 8; break; case GetHubDescriptor: len = sizeof (struct usb_hub_descriptor); break; case DeviceRequest | USB_REQ_GET_DESCRIPTOR: /* len is returned by hub_control */ break; } status = hcd->driver->hub_control (hcd, typeReq, wValue, wIndex, tbuf, wLength); if (typeReq == GetHubDescriptor) usb_hub_adjust_deviceremovable(hcd->self.root_hub, (struct usb_hub_descriptor *)tbuf); break; error: /* "protocol stall" on error */ status = -EPIPE; } if (status < 0) { len = 0; if (status != -EPIPE) { dev_dbg (hcd->self.controller, "CTRL: TypeReq=0x%x val=0x%x " "idx=0x%x len=%d ==> %d\n", typeReq, wValue, wIndex, wLength, status); } } else if (status > 0) { /* hub_control may return the length of data copied. */ len = status; status = 0; } if (len) { if (urb->transfer_buffer_length < len) len = urb->transfer_buffer_length; urb->actual_length = len; /* always USB_DIR_IN, toward host */ memcpy (ubuf, bufp, len); /* report whether RH hardware supports remote wakeup */ if (patch_wakeup && len > offsetof (struct usb_config_descriptor, bmAttributes)) ((struct usb_config_descriptor *)ubuf)->bmAttributes |= USB_CONFIG_ATT_WAKEUP; /* report whether RH hardware has an integrated TT */ if (patch_protocol && len > offsetof(struct usb_device_descriptor, bDeviceProtocol)) ((struct usb_device_descriptor *) ubuf)-> bDeviceProtocol = USB_HUB_PR_HS_SINGLE_TT; } kfree(tbuf); err_alloc: /* any errors get returned through the urb completion */ spin_lock_irq(&hcd_root_hub_lock); usb_hcd_unlink_urb_from_ep(hcd, urb); usb_hcd_giveback_urb(hcd, urb, status); spin_unlock_irq(&hcd_root_hub_lock); return 0; } /*-------------------------------------------------------------------------*/ /* * Root Hub interrupt transfers are polled using a timer if the * driver requests it; otherwise the driver is responsible for * calling usb_hcd_poll_rh_status() when an event occurs. * * Completion handler may not sleep. See usb_hcd_giveback_urb() for details. */ void usb_hcd_poll_rh_status(struct usb_hcd *hcd) { struct urb *urb; int length; int status; unsigned long flags; char buffer[6]; /* Any root hubs with > 31 ports? */ if (unlikely(!hcd->rh_pollable)) return; if (!hcd->uses_new_polling && !hcd->status_urb) return; length = hcd->driver->hub_status_data(hcd, buffer); if (length > 0) { /* try to complete the status urb */ spin_lock_irqsave(&hcd_root_hub_lock, flags); urb = hcd->status_urb; if (urb) { clear_bit(HCD_FLAG_POLL_PENDING, &hcd->flags); hcd->status_urb = NULL; if (urb->transfer_buffer_length >= length) { status = 0; } else { status = -EOVERFLOW; length = urb->transfer_buffer_length; } urb->actual_length = length; memcpy(urb->transfer_buffer, buffer, length); usb_hcd_unlink_urb_from_ep(hcd, urb); usb_hcd_giveback_urb(hcd, urb, status); } else { length = 0; set_bit(HCD_FLAG_POLL_PENDING, &hcd->flags); } spin_unlock_irqrestore(&hcd_root_hub_lock, flags); } /* The USB 2.0 spec says 256 ms. This is close enough and won't * exceed that limit if HZ is 100. The math is more clunky than * maybe expected, this is to make sure that all timers for USB devices * fire at the same time to give the CPU a break in between */ if (hcd->uses_new_polling ? HCD_POLL_RH(hcd) : (length == 0 && hcd->status_urb != NULL)) mod_timer (&hcd->rh_timer, (jiffies/(HZ/4) + 1) * (HZ/4)); } EXPORT_SYMBOL_GPL(usb_hcd_poll_rh_status); /* timer callback */ static void rh_timer_func (struct timer_list *t) { struct usb_hcd *_hcd = timer_container_of(_hcd, t, rh_timer); usb_hcd_poll_rh_status(_hcd); } /*-------------------------------------------------------------------------*/ static int rh_queue_status (struct usb_hcd *hcd, struct urb *urb) { int retval; unsigned long flags; unsigned len = 1 + (urb->dev->maxchild / 8); spin_lock_irqsave (&hcd_root_hub_lock, flags); if (hcd->status_urb || urb->transfer_buffer_length < len) { dev_dbg (hcd->self.controller, "not queuing rh status urb\n"); retval = -EINVAL; goto done; } retval = usb_hcd_link_urb_to_ep(hcd, urb); if (retval) goto done; hcd->status_urb = urb; urb->hcpriv = hcd; /* indicate it's queued */ if (!hcd->uses_new_polling) mod_timer(&hcd->rh_timer, (jiffies/(HZ/4) + 1) * (HZ/4)); /* If a status change has already occurred, report it ASAP */ else if (HCD_POLL_PENDING(hcd)) mod_timer(&hcd->rh_timer, jiffies); retval = 0; done: spin_unlock_irqrestore (&hcd_root_hub_lock, flags); return retval; } static int rh_urb_enqueue (struct usb_hcd *hcd, struct urb *urb) { if (usb_endpoint_xfer_int(&urb->ep->desc)) return rh_queue_status (hcd, urb); if (usb_endpoint_xfer_control(&urb->ep->desc)) return rh_call_control (hcd, urb); return -EINVAL; } /*-------------------------------------------------------------------------*/ /* Unlinks of root-hub control URBs are legal, but they don't do anything * since these URBs always execute synchronously. */ static int usb_rh_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status) { unsigned long flags; int rc; spin_lock_irqsave(&hcd_root_hub_lock, flags); rc = usb_hcd_check_unlink_urb(hcd, urb, status); if (rc) goto done; if (usb_endpoint_num(&urb->ep->desc) == 0) { /* Control URB */ ; /* Do nothing */ } else { /* Status URB */ if (!hcd->uses_new_polling) timer_delete(&hcd->rh_timer); if (urb == hcd->status_urb) { hcd->status_urb = NULL; usb_hcd_unlink_urb_from_ep(hcd, urb); usb_hcd_giveback_urb(hcd, urb, status); } } done: spin_unlock_irqrestore(&hcd_root_hub_lock, flags); return rc; } /*-------------------------------------------------------------------------*/ /** * usb_bus_init - shared initialization code * @bus: the bus structure being initialized * * This code is used to initialize a usb_bus structure, memory for which is * separately managed. */ static void usb_bus_init (struct usb_bus *bus) { memset(&bus->devmap, 0, sizeof(bus->devmap)); bus->devnum_next = 1; bus->root_hub = NULL; bus->busnum = -1; bus->bandwidth_allocated = 0; bus->bandwidth_int_reqs = 0; bus->bandwidth_isoc_reqs = 0; mutex_init(&bus->devnum_next_mutex); } /*-------------------------------------------------------------------------*/ /** * usb_register_bus - registers the USB host controller with the usb core * @bus: pointer to the bus to register * * Context: task context, might sleep. * * Assigns a bus number, and links the controller into usbcore data * structures so that it can be seen by scanning the bus list. * * Return: 0 if successful. A negative error code otherwise. */ static int usb_register_bus(struct usb_bus *bus) { int result = -E2BIG; int busnum; mutex_lock(&usb_bus_idr_lock); busnum = idr_alloc(&usb_bus_idr, bus, 1, USB_MAXBUS, GFP_KERNEL); if (busnum < 0) { pr_err("%s: failed to get bus number\n", usbcore_name); goto error_find_busnum; } bus->busnum = busnum; mutex_unlock(&usb_bus_idr_lock); usb_notify_add_bus(bus); dev_info (bus->controller, "new USB bus registered, assigned bus " "number %d\n", bus->busnum); return 0; error_find_busnum: mutex_unlock(&usb_bus_idr_lock); return result; } /** * usb_deregister_bus - deregisters the USB host controller * @bus: pointer to the bus to deregister * * Context: task context, might sleep. * * Recycles the bus number, and unlinks the controller from usbcore data * structures so that it won't be seen by scanning the bus list. */ static void usb_deregister_bus (struct usb_bus *bus) { dev_info (bus->controller, "USB bus %d deregistered\n", bus->busnum); /* * NOTE: make sure that all the devices are removed by the * controller code, as well as having it call this when cleaning * itself up */ mutex_lock(&usb_bus_idr_lock); idr_remove(&usb_bus_idr, bus->busnum); mutex_unlock(&usb_bus_idr_lock); usb_notify_remove_bus(bus); } /** * register_root_hub - called by usb_add_hcd() to register a root hub * @hcd: host controller for this root hub * * This function registers the root hub with the USB subsystem. It sets up * the device properly in the device tree and then calls usb_new_device() * to register the usb device. It also assigns the root hub's USB address * (always 1). * * Return: 0 if successful. A negative error code otherwise. */ static int register_root_hub(struct usb_hcd *hcd) { struct device *parent_dev = hcd->self.controller; struct usb_device *usb_dev = hcd->self.root_hub; struct usb_device_descriptor *descr; const int devnum = 1; int retval; usb_dev->devnum = devnum; usb_dev->bus->devnum_next = devnum + 1; set_bit(devnum, usb_dev->bus->devmap); usb_set_device_state(usb_dev, USB_STATE_ADDRESS); mutex_lock(&usb_bus_idr_lock); usb_dev->ep0.desc.wMaxPacketSize = cpu_to_le16(64); descr = usb_get_device_descriptor(usb_dev); if (IS_ERR(descr)) { retval = PTR_ERR(descr); mutex_unlock(&usb_bus_idr_lock); dev_dbg (parent_dev, "can't read %s device descriptor %d\n", dev_name(&usb_dev->dev), retval); return retval; } usb_dev->descriptor = *descr; kfree(descr); if (le16_to_cpu(usb_dev->descriptor.bcdUSB) >= 0x0201) { retval = usb_get_bos_descriptor(usb_dev); if (!retval) { usb_dev->lpm_capable = usb_device_supports_lpm(usb_dev); } else if (usb_dev->speed >= USB_SPEED_SUPER) { mutex_unlock(&usb_bus_idr_lock); dev_dbg(parent_dev, "can't read %s bos descriptor %d\n", dev_name(&usb_dev->dev), retval); return retval; } } retval = usb_new_device (usb_dev); if (retval) { dev_err (parent_dev, "can't register root hub for %s, %d\n", dev_name(&usb_dev->dev), retval); } else { spin_lock_irq (&hcd_root_hub_lock); hcd->rh_registered = 1; spin_unlock_irq (&hcd_root_hub_lock); /* Did the HC die before the root hub was registered? */ if (HCD_DEAD(hcd)) usb_hc_died (hcd); /* This time clean up */ } mutex_unlock(&usb_bus_idr_lock); return retval; } /* * usb_hcd_start_port_resume - a root-hub port is sending a resume signal * @bus: the bus which the root hub belongs to * @portnum: the port which is being resumed * * HCDs should call this function when they know that a resume signal is * being sent to a root-hub port. The root hub will be prevented from * going into autosuspend until usb_hcd_end_port_resume() is called. * * The bus's private lock must be held by the caller. */ void usb_hcd_start_port_resume(struct usb_bus *bus, int portnum) { unsigned bit = 1 << portnum; if (!(bus->resuming_ports & bit)) { bus->resuming_ports |= bit; pm_runtime_get_noresume(&bus->root_hub->dev); } } EXPORT_SYMBOL_GPL(usb_hcd_start_port_resume); /* * usb_hcd_end_port_resume - a root-hub port has stopped sending a resume signal * @bus: the bus which the root hub belongs to * @portnum: the port which is being resumed * * HCDs should call this function when they know that a resume signal has * stopped being sent to a root-hub port. The root hub will be allowed to * autosuspend again. * * The bus's private lock must be held by the caller. */ void usb_hcd_end_port_resume(struct usb_bus *bus, int portnum) { unsigned bit = 1 << portnum; if (bus->resuming_ports & bit) { bus->resuming_ports &= ~bit; pm_runtime_put_noidle(&bus->root_hub->dev); } } EXPORT_SYMBOL_GPL(usb_hcd_end_port_resume); /*-------------------------------------------------------------------------*/ /** * usb_calc_bus_time - approximate periodic transaction time in nanoseconds * @speed: from dev->speed; USB_SPEED_{LOW,FULL,HIGH} * @is_input: true iff the transaction sends data to the host * @isoc: true for isochronous transactions, false for interrupt ones * @bytecount: how many bytes in the transaction. * * Return: Approximate bus time in nanoseconds for a periodic transaction. * * Note: * See USB 2.0 spec section 5.11.3; only periodic transfers need to be * scheduled in software, this function is only used for such scheduling. */ long usb_calc_bus_time (int speed, int is_input, int isoc, int bytecount) { unsigned long tmp; switch (speed) { case USB_SPEED_LOW: /* INTR only */ if (is_input) { tmp = (67667L * (31L + 10L * BitTime (bytecount))) / 1000L; return 64060L + (2 * BW_HUB_LS_SETUP) + BW_HOST_DELAY + tmp; } else { tmp = (66700L * (31L + 10L * BitTime (bytecount))) / 1000L; return 64107L + (2 * BW_HUB_LS_SETUP) + BW_HOST_DELAY + tmp; } case USB_SPEED_FULL: /* ISOC or INTR */ if (isoc) { tmp = (8354L * (31L + 10L * BitTime (bytecount))) / 1000L; return ((is_input) ? 7268L : 6265L) + BW_HOST_DELAY + tmp; } else { tmp = (8354L * (31L + 10L * BitTime (bytecount))) / 1000L; return 9107L + BW_HOST_DELAY + tmp; } case USB_SPEED_HIGH: /* ISOC or INTR */ /* FIXME adjust for input vs output */ if (isoc) tmp = HS_NSECS_ISO (bytecount); else tmp = HS_NSECS (bytecount); return tmp; default: pr_debug ("%s: bogus device speed!\n", usbcore_name); return -1; } } EXPORT_SYMBOL_GPL(usb_calc_bus_time); /*-------------------------------------------------------------------------*/ /* * Generic HC operations. */ /*-------------------------------------------------------------------------*/ /** * usb_hcd_link_urb_to_ep - add an URB to its endpoint queue * @hcd: host controller to which @urb was submitted * @urb: URB being submitted * * Host controller drivers should call this routine in their enqueue() * method. The HCD's private spinlock must be held and interrupts must * be disabled. The actions carried out here are required for URB * submission, as well as for endpoint shutdown and for usb_kill_urb. * * Return: 0 for no error, otherwise a negative error code (in which case * the enqueue() method must fail). If no error occurs but enqueue() fails * anyway, it must call usb_hcd_unlink_urb_from_ep() before releasing * the private spinlock and returning. */ int usb_hcd_link_urb_to_ep(struct usb_hcd *hcd, struct urb *urb) { int rc = 0; spin_lock(&hcd_urb_list_lock); /* Check that the URB isn't being killed */ if (unlikely(atomic_read(&urb->reject))) { rc = -EPERM; goto done; } if (unlikely(!urb->ep->enabled)) { rc = -ENOENT; goto done; } if (unlikely(!urb->dev->can_submit)) { rc = -EHOSTUNREACH; goto done; } /* * Check the host controller's state and add the URB to the * endpoint's queue. */ if (HCD_RH_RUNNING(hcd)) { urb->unlinked = 0; list_add_tail(&urb->urb_list, &urb->ep->urb_list); } else { rc = -ESHUTDOWN; goto done; } done: spin_unlock(&hcd_urb_list_lock); return rc; } EXPORT_SYMBOL_GPL(usb_hcd_link_urb_to_ep); /** * usb_hcd_check_unlink_urb - check whether an URB may be unlinked * @hcd: host controller to which @urb was submitted * @urb: URB being checked for unlinkability * @status: error code to store in @urb if the unlink succeeds * * Host controller drivers should call this routine in their dequeue() * method. The HCD's private spinlock must be held and interrupts must * be disabled. The actions carried out here are required for making * sure than an unlink is valid. * * Return: 0 for no error, otherwise a negative error code (in which case * the dequeue() method must fail). The possible error codes are: * * -EIDRM: @urb was not submitted or has already completed. * The completion function may not have been called yet. * * -EBUSY: @urb has already been unlinked. */ int usb_hcd_check_unlink_urb(struct usb_hcd *hcd, struct urb *urb, int status) { struct list_head *tmp; /* insist the urb is still queued */ list_for_each(tmp, &urb->ep->urb_list) { if (tmp == &urb->urb_list) break; } if (tmp != &urb->urb_list) return -EIDRM; /* Any status except -EINPROGRESS means something already started to * unlink this URB from the hardware. So there's no more work to do. */ if (urb->unlinked) return -EBUSY; urb->unlinked = status; return 0; } EXPORT_SYMBOL_GPL(usb_hcd_check_unlink_urb); /** * usb_hcd_unlink_urb_from_ep - remove an URB from its endpoint queue * @hcd: host controller to which @urb was submitted * @urb: URB being unlinked * * Host controller drivers should call this routine before calling * usb_hcd_giveback_urb(). The HCD's private spinlock must be held and * interrupts must be disabled. The actions carried out here are required * for URB completion. */ void usb_hcd_unlink_urb_from_ep(struct usb_hcd *hcd, struct urb *urb) { /* clear all state linking urb to this dev (and hcd) */ spin_lock(&hcd_urb_list_lock); list_del_init(&urb->urb_list); spin_unlock(&hcd_urb_list_lock); } EXPORT_SYMBOL_GPL(usb_hcd_unlink_urb_from_ep); /* * Some usb host controllers can only perform dma using a small SRAM area, * or have restrictions on addressable DRAM. * The usb core itself is however optimized for host controllers that can dma * using regular system memory - like pci devices doing bus mastering. * * To support host controllers with limited dma capabilities we provide dma * bounce buffers. This feature can be enabled by initializing * hcd->localmem_pool using usb_hcd_setup_local_mem(). * * The initialized hcd->localmem_pool then tells the usb code to allocate all * data for dma using the genalloc API. * * So, to summarize... * * - We need "local" memory, canonical example being * a small SRAM on a discrete controller being the * only memory that the controller can read ... * (a) "normal" kernel memory is no good, and * (b) there's not enough to share * * - So we use that, even though the primary requirement * is that the memory be "local" (hence addressable * by that device), not "coherent". * */ static int hcd_alloc_coherent(struct usb_bus *bus, gfp_t mem_flags, dma_addr_t *dma_handle, void **vaddr_handle, size_t size, enum dma_data_direction dir) { unsigned char *vaddr; if (*vaddr_handle == NULL) { WARN_ON_ONCE(1); return -EFAULT; } vaddr = hcd_buffer_alloc(bus, size + sizeof(unsigned long), mem_flags, dma_handle); if (!vaddr) return -ENOMEM; /* * Store the virtual address of the buffer at the end * of the allocated dma buffer. The size of the buffer * may be uneven so use unaligned functions instead * of just rounding up. It makes sense to optimize for * memory footprint over access speed since the amount * of memory available for dma may be limited. */ put_unaligned((unsigned long)*vaddr_handle, (unsigned long *)(vaddr + size)); if (dir == DMA_TO_DEVICE) memcpy(vaddr, *vaddr_handle, size); *vaddr_handle = vaddr; return 0; } static void hcd_free_coherent(struct usb_bus *bus, dma_addr_t *dma_handle, void **vaddr_handle, size_t size, enum dma_data_direction dir) { unsigned char *vaddr = *vaddr_handle; vaddr = (void *)get_unaligned((unsigned long *)(vaddr + size)); if (dir == DMA_FROM_DEVICE) memcpy(vaddr, *vaddr_handle, size); hcd_buffer_free(bus, size + sizeof(vaddr), *vaddr_handle, *dma_handle); *vaddr_handle = vaddr; *dma_handle = 0; } void usb_hcd_unmap_urb_setup_for_dma(struct usb_hcd *hcd, struct urb *urb) { if (IS_ENABLED(CONFIG_HAS_DMA) && (urb->transfer_flags & URB_SETUP_MAP_SINGLE)) dma_unmap_single(hcd->self.sysdev, urb->setup_dma, sizeof(struct usb_ctrlrequest), DMA_TO_DEVICE); else if (urb->transfer_flags & URB_SETUP_MAP_LOCAL) hcd_free_coherent(urb->dev->bus, &urb->setup_dma, (void **) &urb->setup_packet, sizeof(struct usb_ctrlrequest), DMA_TO_DEVICE); /* Make it safe to call this routine more than once */ urb->transfer_flags &= ~(URB_SETUP_MAP_SINGLE | URB_SETUP_MAP_LOCAL); } EXPORT_SYMBOL_GPL(usb_hcd_unmap_urb_setup_for_dma); static void unmap_urb_for_dma(struct usb_hcd *hcd, struct urb *urb) { if (hcd->driver->unmap_urb_for_dma) hcd->driver->unmap_urb_for_dma(hcd, urb); else usb_hcd_unmap_urb_for_dma(hcd, urb); } void usb_hcd_unmap_urb_for_dma(struct usb_hcd *hcd, struct urb *urb) { enum dma_data_direction dir; usb_hcd_unmap_urb_setup_for_dma(hcd, urb); dir = usb_urb_dir_in(urb) ? DMA_FROM_DEVICE : DMA_TO_DEVICE; if (IS_ENABLED(CONFIG_HAS_DMA) && (urb->transfer_flags & URB_DMA_MAP_SG)) { dma_unmap_sg(hcd->self.sysdev, urb->sg, urb->num_sgs, dir); } else if (IS_ENABLED(CONFIG_HAS_DMA) && (urb->transfer_flags & URB_DMA_MAP_PAGE)) { dma_unmap_page(hcd->self.sysdev, urb->transfer_dma, urb->transfer_buffer_length, dir); } else if (IS_ENABLED(CONFIG_HAS_DMA) && (urb->transfer_flags & URB_DMA_MAP_SINGLE)) { dma_unmap_single(hcd->self.sysdev, urb->transfer_dma, urb->transfer_buffer_length, dir); } else if (urb->transfer_flags & URB_MAP_LOCAL) { hcd_free_coherent(urb->dev->bus, &urb->transfer_dma, &urb->transfer_buffer, urb->transfer_buffer_length, dir); } else if ((urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP) && urb->sgt) { dma_sync_sgtable_for_cpu(hcd->self.sysdev, urb->sgt, dir); if (dir == DMA_FROM_DEVICE) invalidate_kernel_vmap_range(urb->transfer_buffer, urb->transfer_buffer_length); } /* Make it safe to call this routine more than once */ urb->transfer_flags &= ~(URB_DMA_MAP_SG | URB_DMA_MAP_PAGE | URB_DMA_MAP_SINGLE | URB_MAP_LOCAL); } EXPORT_SYMBOL_GPL(usb_hcd_unmap_urb_for_dma); static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb, gfp_t mem_flags) { if (hcd->driver->map_urb_for_dma) return hcd->driver->map_urb_for_dma(hcd, urb, mem_flags); else return usb_hcd_map_urb_for_dma(hcd, urb, mem_flags); } int usb_hcd_map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb, gfp_t mem_flags) { enum dma_data_direction dir; int ret = 0; /* Map the URB's buffers for DMA access. * Lower level HCD code should use *_dma exclusively, * unless it uses pio or talks to another transport, * or uses the provided scatter gather list for bulk. */ if (usb_endpoint_xfer_control(&urb->ep->desc)) { if (hcd->self.uses_pio_for_control) return ret; if (hcd->localmem_pool) { ret = hcd_alloc_coherent( urb->dev->bus, mem_flags, &urb->setup_dma, (void **)&urb->setup_packet, sizeof(struct usb_ctrlrequest), DMA_TO_DEVICE); if (ret) return ret; urb->transfer_flags |= URB_SETUP_MAP_LOCAL; } else if (hcd_uses_dma(hcd)) { if (object_is_on_stack(urb->setup_packet)) { WARN_ONCE(1, "setup packet is on stack\n"); return -EAGAIN; } urb->setup_dma = dma_map_single( hcd->self.sysdev, urb->setup_packet, sizeof(struct usb_ctrlrequest), DMA_TO_DEVICE); if (dma_mapping_error(hcd->self.sysdev, urb->setup_dma)) return -EAGAIN; urb->transfer_flags |= URB_SETUP_MAP_SINGLE; } } dir = usb_urb_dir_in(urb) ? DMA_FROM_DEVICE : DMA_TO_DEVICE; if (urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP) { if (!urb->sgt) return 0; if (dir == DMA_TO_DEVICE) flush_kernel_vmap_range(urb->transfer_buffer, urb->transfer_buffer_length); dma_sync_sgtable_for_device(hcd->self.sysdev, urb->sgt, dir); } else if (urb->transfer_buffer_length != 0) { if (hcd->localmem_pool) { ret = hcd_alloc_coherent( urb->dev->bus, mem_flags, &urb->transfer_dma, &urb->transfer_buffer, urb->transfer_buffer_length, dir); if (ret == 0) urb->transfer_flags |= URB_MAP_LOCAL; } else if (hcd_uses_dma(hcd)) { if (urb->num_sgs) { int n; /* We don't support sg for isoc transfers ! */ if (usb_endpoint_xfer_isoc(&urb->ep->desc)) { WARN_ON(1); return -EINVAL; } n = dma_map_sg( hcd->self.sysdev, urb->sg, urb->num_sgs, dir); if (!n) ret = -EAGAIN; else urb->transfer_flags |= URB_DMA_MAP_SG; urb->num_mapped_sgs = n; if (n != urb->num_sgs) urb->transfer_flags |= URB_DMA_SG_COMBINED; } else if (urb->sg) { struct scatterlist *sg = urb->sg; urb->transfer_dma = dma_map_page( hcd->self.sysdev, sg_page(sg), sg->offset, urb->transfer_buffer_length, dir); if (dma_mapping_error(hcd->self.sysdev, urb->transfer_dma)) ret = -EAGAIN; else urb->transfer_flags |= URB_DMA_MAP_PAGE; } else if (object_is_on_stack(urb->transfer_buffer)) { WARN_ONCE(1, "transfer buffer is on stack\n"); ret = -EAGAIN; } else { urb->transfer_dma = dma_map_single( hcd->self.sysdev, urb->transfer_buffer, urb->transfer_buffer_length, dir); if (dma_mapping_error(hcd->self.sysdev, urb->transfer_dma)) ret = -EAGAIN; else urb->transfer_flags |= URB_DMA_MAP_SINGLE; } } if (ret && (urb->transfer_flags & (URB_SETUP_MAP_SINGLE | URB_SETUP_MAP_LOCAL))) usb_hcd_unmap_urb_for_dma(hcd, urb); } return ret; } EXPORT_SYMBOL_GPL(usb_hcd_map_urb_for_dma); /*-------------------------------------------------------------------------*/ /* may be called in any context with a valid urb->dev usecount * caller surrenders "ownership" of urb * expects usb_submit_urb() to have sanity checked and conditioned all * inputs in the urb */ int usb_hcd_submit_urb (struct urb *urb, gfp_t mem_flags) { int status; struct usb_hcd *hcd = bus_to_hcd(urb->dev->bus); /* increment urb's reference count as part of giving it to the HCD * (which will control it). HCD guarantees that it either returns * an error or calls giveback(), but not both. */ usb_get_urb(urb); atomic_inc(&urb->use_count); atomic_inc(&urb->dev->urbnum); usbmon_urb_submit(&hcd->self, urb); /* NOTE requirements on root-hub callers (usbfs and the hub * driver, for now): URBs' urb->transfer_buffer must be * valid and usb_buffer_{sync,unmap}() not be needed, since * they could clobber root hub response data. Also, control * URBs must be submitted in process context with interrupts * enabled. */ if (is_root_hub(urb->dev)) { status = rh_urb_enqueue(hcd, urb); } else { status = map_urb_for_dma(hcd, urb, mem_flags); if (likely(status == 0)) { status = hcd->driver->urb_enqueue(hcd, urb, mem_flags); if (unlikely(status)) unmap_urb_for_dma(hcd, urb); } } if (unlikely(status)) { usbmon_urb_submit_error(&hcd->self, urb, status); urb->hcpriv = NULL; INIT_LIST_HEAD(&urb->urb_list); atomic_dec(&urb->use_count); /* * Order the write of urb->use_count above before the read * of urb->reject below. Pairs with the memory barriers in * usb_kill_urb() and usb_poison_urb(). */ smp_mb__after_atomic(); atomic_dec(&urb->dev->urbnum); if (atomic_read(&urb->reject)) wake_up(&usb_kill_urb_queue); usb_put_urb(urb); } return status; } /*-------------------------------------------------------------------------*/ /* this makes the hcd giveback() the urb more quickly, by kicking it * off hardware queues (which may take a while) and returning it as * soon as practical. we've already set up the urb's return status, * but we can't know if the callback completed already. */ static int unlink1(struct usb_hcd *hcd, struct urb *urb, int status) { int value; if (is_root_hub(urb->dev)) value = usb_rh_urb_dequeue(hcd, urb, status); else { /* The only reason an HCD might fail this call is if * it has not yet fully queued the urb to begin with. * Such failures should be harmless. */ value = hcd->driver->urb_dequeue(hcd, urb, status); } return value; } /* * called in any context * * caller guarantees urb won't be recycled till both unlink() * and the urb's completion function return */ int usb_hcd_unlink_urb (struct urb *urb, int status) { struct usb_hcd *hcd; struct usb_device *udev = urb->dev; int retval = -EIDRM; unsigned long flags; /* Prevent the device and bus from going away while * the unlink is carried out. If they are already gone * then urb->use_count must be 0, since disconnected * devices can't have any active URBs. */ spin_lock_irqsave(&hcd_urb_unlink_lock, flags); if (atomic_read(&urb->use_count) > 0) { retval = 0; usb_get_dev(udev); } spin_unlock_irqrestore(&hcd_urb_unlink_lock, flags); if (retval == 0) { hcd = bus_to_hcd(urb->dev->bus); retval = unlink1(hcd, urb, status); if (retval == 0) retval = -EINPROGRESS; else if (retval != -EIDRM && retval != -EBUSY) dev_dbg(&udev->dev, "hcd_unlink_urb %p fail %d\n", urb, retval); usb_put_dev(udev); } return retval; } /*-------------------------------------------------------------------------*/ static void __usb_hcd_giveback_urb(struct urb *urb) { struct usb_hcd *hcd = bus_to_hcd(urb->dev->bus); struct usb_anchor *anchor = urb->anchor; int status = urb->unlinked; urb->hcpriv = NULL; if (unlikely((urb->transfer_flags & URB_SHORT_NOT_OK) && urb->actual_length < urb->transfer_buffer_length && !status)) status = -EREMOTEIO; unmap_urb_for_dma(hcd, urb); usbmon_urb_complete(&hcd->self, urb, status); usb_anchor_suspend_wakeups(anchor); usb_unanchor_urb(urb); if (likely(status == 0)) usb_led_activity(USB_LED_EVENT_HOST); /* pass ownership to the completion handler */ urb->status = status; /* * This function can be called in task context inside another remote * coverage collection section, but kcov doesn't support that kind of * recursion yet. Only collect coverage in softirq context for now. */ kcov_remote_start_usb_softirq((u64)urb->dev->bus->busnum); urb->complete(urb); kcov_remote_stop_softirq(); usb_anchor_resume_wakeups(anchor); atomic_dec(&urb->use_count); /* * Order the write of urb->use_count above before the read * of urb->reject below. Pairs with the memory barriers in * usb_kill_urb() and usb_poison_urb(). */ smp_mb__after_atomic(); if (unlikely(atomic_read(&urb->reject))) wake_up(&usb_kill_urb_queue); usb_put_urb(urb); } static void usb_giveback_urb_bh(struct work_struct *work) { struct giveback_urb_bh *bh = container_of(work, struct giveback_urb_bh, bh); struct list_head local_list; spin_lock_irq(&bh->lock); bh->running = true; list_replace_init(&bh->head, &local_list); spin_unlock_irq(&bh->lock); while (!list_empty(&local_list)) { struct urb *urb; urb = list_entry(local_list.next, struct urb, urb_list); list_del_init(&urb->urb_list); bh->completing_ep = urb->ep; __usb_hcd_giveback_urb(urb); bh->completing_ep = NULL; } /* * giveback new URBs next time to prevent this function * from not exiting for a long time. */ spin_lock_irq(&bh->lock); if (!list_empty(&bh->head)) { if (bh->high_prio) queue_work(system_bh_highpri_wq, &bh->bh); else queue_work(system_bh_wq, &bh->bh); } bh->running = false; spin_unlock_irq(&bh->lock); } /** * usb_hcd_giveback_urb - return URB from HCD to device driver * @hcd: host controller returning the URB * @urb: urb being returned to the USB device driver. * @status: completion status code for the URB. * * Context: atomic. The completion callback is invoked either in a work queue * (BH) context or in the caller's context, depending on whether the HCD_BH * flag is set in the @hcd structure, except that URBs submitted to the * root hub always complete in BH context. * * This hands the URB from HCD to its USB device driver, using its * completion function. The HCD has freed all per-urb resources * (and is done using urb->hcpriv). It also released all HCD locks; * the device driver won't cause problems if it frees, modifies, * or resubmits this URB. * * If @urb was unlinked, the value of @status will be overridden by * @urb->unlinked. Erroneous short transfers are detected in case * the HCD hasn't checked for them. */ void usb_hcd_giveback_urb(struct usb_hcd *hcd, struct urb *urb, int status) { struct giveback_urb_bh *bh; bool running; /* pass status to BH via unlinked */ if (likely(!urb->unlinked)) urb->unlinked = status; if (!hcd_giveback_urb_in_bh(hcd) && !is_root_hub(urb->dev)) { __usb_hcd_giveback_urb(urb); return; } if (usb_pipeisoc(urb->pipe) || usb_pipeint(urb->pipe)) bh = &hcd->high_prio_bh; else bh = &hcd->low_prio_bh; spin_lock(&bh->lock); list_add_tail(&urb->urb_list, &bh->head); running = bh->running; spin_unlock(&bh->lock); if (running) ; else if (bh->high_prio) queue_work(system_bh_highpri_wq, &bh->bh); else queue_work(system_bh_wq, &bh->bh); } EXPORT_SYMBOL_GPL(usb_hcd_giveback_urb); /*-------------------------------------------------------------------------*/ /* Cancel all URBs pending on this endpoint and wait for the endpoint's * queue to drain completely. The caller must first insure that no more * URBs can be submitted for this endpoint. */ void usb_hcd_flush_endpoint(struct usb_device *udev, struct usb_host_endpoint *ep) { struct usb_hcd *hcd; struct urb *urb; if (!ep) return; might_sleep(); hcd = bus_to_hcd(udev->bus); /* No more submits can occur */ spin_lock_irq(&hcd_urb_list_lock); rescan: list_for_each_entry_reverse(urb, &ep->urb_list, urb_list) { int is_in; if (urb->unlinked) continue; usb_get_urb (urb); is_in = usb_urb_dir_in(urb); spin_unlock(&hcd_urb_list_lock); /* kick hcd */ unlink1(hcd, urb, -ESHUTDOWN); dev_dbg (hcd->self.controller, "shutdown urb %p ep%d%s-%s\n", urb, usb_endpoint_num(&ep->desc), is_in ? "in" : "out", usb_ep_type_string(usb_endpoint_type(&ep->desc))); usb_put_urb (urb); /* list contents may have changed */ spin_lock(&hcd_urb_list_lock); goto rescan; } spin_unlock_irq(&hcd_urb_list_lock); /* Wait until the endpoint queue is completely empty */ while (!list_empty (&ep->urb_list)) { spin_lock_irq(&hcd_urb_list_lock); /* The list may have changed while we acquired the spinlock */ urb = NULL; if (!list_empty (&ep->urb_list)) { urb = list_entry (ep->urb_list.prev, struct urb, urb_list); usb_get_urb (urb); } spin_unlock_irq(&hcd_urb_list_lock); if (urb) { usb_kill_urb (urb); usb_put_urb (urb); } } } /** * usb_hcd_alloc_bandwidth - check whether a new bandwidth setting exceeds * the bus bandwidth * @udev: target &usb_device * @new_config: new configuration to install * @cur_alt: the current alternate interface setting * @new_alt: alternate interface setting that is being installed * * To change configurations, pass in the new configuration in new_config, * and pass NULL for cur_alt and new_alt. * * To reset a device's configuration (put the device in the ADDRESSED state), * pass in NULL for new_config, cur_alt, and new_alt. * * To change alternate interface settings, pass in NULL for new_config, * pass in the current alternate interface setting in cur_alt, * and pass in the new alternate interface setting in new_alt. * * Return: An error if the requested bandwidth change exceeds the * bus bandwidth or host controller internal resources. */ int usb_hcd_alloc_bandwidth(struct usb_device *udev, struct usb_host_config *new_config, struct usb_host_interface *cur_alt, struct usb_host_interface *new_alt) { int num_intfs, i, j; struct usb_host_interface *alt = NULL; int ret = 0; struct usb_hcd *hcd; struct usb_host_endpoint *ep; hcd = bus_to_hcd(udev->bus); if (!hcd->driver->check_bandwidth) return 0; /* Configuration is being removed - set configuration 0 */ if (!new_config && !cur_alt) { for (i = 1; i < 16; ++i) { ep = udev->ep_out[i]; if (ep) hcd->driver->drop_endpoint(hcd, udev, ep); ep = udev->ep_in[i]; if (ep) hcd->driver->drop_endpoint(hcd, udev, ep); } hcd->driver->check_bandwidth(hcd, udev); return 0; } /* Check if the HCD says there's enough bandwidth. Enable all endpoints * each interface's alt setting 0 and ask the HCD to check the bandwidth * of the bus. There will always be bandwidth for endpoint 0, so it's * ok to exclude it. */ if (new_config) { num_intfs = new_config->desc.bNumInterfaces; /* Remove endpoints (except endpoint 0, which is always on the * schedule) from the old config from the schedule */ for (i = 1; i < 16; ++i) { ep = udev->ep_out[i]; if (ep) { ret = hcd->driver->drop_endpoint(hcd, udev, ep); if (ret < 0) goto reset; } ep = udev->ep_in[i]; if (ep) { ret = hcd->driver->drop_endpoint(hcd, udev, ep); if (ret < 0) goto reset; } } for (i = 0; i < num_intfs; ++i) { struct usb_host_interface *first_alt; int iface_num; first_alt = &new_config->intf_cache[i]->altsetting[0]; iface_num = first_alt->desc.bInterfaceNumber; /* Set up endpoints for alternate interface setting 0 */ alt = usb_find_alt_setting(new_config, iface_num, 0); if (!alt) /* No alt setting 0? Pick the first setting. */ alt = first_alt; for (j = 0; j < alt->desc.bNumEndpoints; j++) { ret = hcd->driver->add_endpoint(hcd, udev, &alt->endpoint[j]); if (ret < 0) goto reset; } } } if (cur_alt && new_alt) { struct usb_interface *iface = usb_ifnum_to_if(udev, cur_alt->desc.bInterfaceNumber); if (!iface) return -EINVAL; if (iface->resetting_device) { /* * The USB core just reset the device, so the xHCI host * and the device will think alt setting 0 is installed. * However, the USB core will pass in the alternate * setting installed before the reset as cur_alt. Dig * out the alternate setting 0 structure, or the first * alternate setting if a broken device doesn't have alt * setting 0. */ cur_alt = usb_altnum_to_altsetting(iface, 0); if (!cur_alt) cur_alt = &iface->altsetting[0]; } /* Drop all the endpoints in the current alt setting */ for (i = 0; i < cur_alt->desc.bNumEndpoints; i++) { ret = hcd->driver->drop_endpoint(hcd, udev, &cur_alt->endpoint[i]); if (ret < 0) goto reset; } /* Add all the endpoints in the new alt setting */ for (i = 0; i < new_alt->desc.bNumEndpoints; i++) { ret = hcd->driver->add_endpoint(hcd, udev, &new_alt->endpoint[i]); if (ret < 0) goto reset; } } ret = hcd->driver->check_bandwidth(hcd, udev); reset: if (ret < 0) hcd->driver->reset_bandwidth(hcd, udev); return ret; } /* Disables the endpoint: synchronizes with the hcd to make sure all * endpoint state is gone from hardware. usb_hcd_flush_endpoint() must * have been called previously. Use for set_configuration, set_interface, * driver removal, physical disconnect. * * example: a qh stored in ep->hcpriv, holding state related to endpoint * type, maxpacket size, toggle, halt status, and scheduling. */ void usb_hcd_disable_endpoint(struct usb_device *udev, struct usb_host_endpoint *ep) { struct usb_hcd *hcd; might_sleep(); hcd = bus_to_hcd(udev->bus); if (hcd->driver->endpoint_disable) hcd->driver->endpoint_disable(hcd, ep); } /** * usb_hcd_reset_endpoint - reset host endpoint state * @udev: USB device. * @ep: the endpoint to reset. * * Resets any host endpoint state such as the toggle bit, sequence * number and current window. */ void usb_hcd_reset_endpoint(struct usb_device *udev, struct usb_host_endpoint *ep) { struct usb_hcd *hcd = bus_to_hcd(udev->bus); if (hcd->driver->endpoint_reset) hcd->driver->endpoint_reset(hcd, ep); else { int epnum = usb_endpoint_num(&ep->desc); int is_out = usb_endpoint_dir_out(&ep->desc); int is_control = usb_endpoint_xfer_control(&ep->desc); usb_settoggle(udev, epnum, is_out, 0); if (is_control) usb_settoggle(udev, epnum, !is_out, 0); } } /** * usb_alloc_streams - allocate bulk endpoint stream IDs. * @interface: alternate setting that includes all endpoints. * @eps: array of endpoints that need streams. * @num_eps: number of endpoints in the array. * @num_streams: number of streams to allocate. * @mem_flags: flags hcd should use to allocate memory. * * Sets up a group of bulk endpoints to have @num_streams stream IDs available. * Drivers may queue multiple transfers to different stream IDs, which may * complete in a different order than they were queued. * * Return: On success, the number of allocated streams. On failure, a negative * error code. */ int usb_alloc_streams(struct usb_interface *interface, struct usb_host_endpoint **eps, unsigned int num_eps, unsigned int num_streams, gfp_t mem_flags) { struct usb_hcd *hcd; struct usb_device *dev; int i, ret; dev = interface_to_usbdev(interface); hcd = bus_to_hcd(dev->bus); if (!hcd->driver->alloc_streams || !hcd->driver->free_streams) return -EINVAL; if (dev->speed < USB_SPEED_SUPER) return -EINVAL; if (dev->state < USB_STATE_CONFIGURED) return -ENODEV; for (i = 0; i < num_eps; i++) { /* Streams only apply to bulk endpoints. */ if (!usb_endpoint_xfer_bulk(&eps[i]->desc)) return -EINVAL; /* Re-alloc is not allowed */ if (eps[i]->streams) return -EINVAL; } ret = hcd->driver->alloc_streams(hcd, dev, eps, num_eps, num_streams, mem_flags); if (ret < 0) return ret; for (i = 0; i < num_eps; i++) eps[i]->streams = ret; return ret; } EXPORT_SYMBOL_GPL(usb_alloc_streams); /** * usb_free_streams - free bulk endpoint stream IDs. * @interface: alternate setting that includes all endpoints. * @eps: array of endpoints to remove streams from. * @num_eps: number of endpoints in the array. * @mem_flags: flags hcd should use to allocate memory. * * Reverts a group of bulk endpoints back to not using stream IDs. * Can fail if we are given bad arguments, or HCD is broken. * * Return: 0 on success. On failure, a negative error code. */ int usb_free_streams(struct usb_interface *interface, struct usb_host_endpoint **eps, unsigned int num_eps, gfp_t mem_flags) { struct usb_hcd *hcd; struct usb_device *dev; int i, ret; dev = interface_to_usbdev(interface); hcd = bus_to_hcd(dev->bus); if (dev->speed < USB_SPEED_SUPER) return -EINVAL; /* Double-free is not allowed */ for (i = 0; i < num_eps; i++) if (!eps[i] || !eps[i]->streams) return -EINVAL; ret = hcd->driver->free_streams(hcd, dev, eps, num_eps, mem_flags); if (ret < 0) return ret; for (i = 0; i < num_eps; i++) eps[i]->streams = 0; return ret; } EXPORT_SYMBOL_GPL(usb_free_streams); /* Protect against drivers that try to unlink URBs after the device * is gone, by waiting until all unlinks for @udev are finished. * Since we don't currently track URBs by device, simply wait until * nothing is running in the locked region of usb_hcd_unlink_urb(). */ void usb_hcd_synchronize_unlinks(struct usb_device *udev) { spin_lock_irq(&hcd_urb_unlink_lock); spin_unlock_irq(&hcd_urb_unlink_lock); } /*-------------------------------------------------------------------------*/ /* called in any context */ int usb_hcd_get_frame_number (struct usb_device *udev) { struct usb_hcd *hcd = bus_to_hcd(udev->bus); if (!HCD_RH_RUNNING(hcd)) return -ESHUTDOWN; return hcd->driver->get_frame_number (hcd); } /*-------------------------------------------------------------------------*/ #ifdef CONFIG_USB_HCD_TEST_MODE static void usb_ehset_completion(struct urb *urb) { struct completion *done = urb->context; complete(done); } /* * Allocate and initialize a control URB. This request will be used by the * EHSET SINGLE_STEP_SET_FEATURE test in which the DATA and STATUS stages * of the GetDescriptor request are sent 15 seconds after the SETUP stage. * Return NULL if failed. */ static struct urb *request_single_step_set_feature_urb( struct usb_device *udev, void *dr, void *buf, struct completion *done) { struct urb *urb; struct usb_hcd *hcd = bus_to_hcd(udev->bus); urb = usb_alloc_urb(0, GFP_KERNEL); if (!urb) return NULL; urb->pipe = usb_rcvctrlpipe(udev, 0); urb->ep = &udev->ep0; urb->dev = udev; urb->setup_packet = (void *)dr; urb->transfer_buffer = buf; urb->transfer_buffer_length = USB_DT_DEVICE_SIZE; urb->complete = usb_ehset_completion; urb->status = -EINPROGRESS; urb->actual_length = 0; urb->transfer_flags = URB_DIR_IN | URB_NO_TRANSFER_DMA_MAP; usb_get_urb(urb); atomic_inc(&urb->use_count); atomic_inc(&urb->dev->urbnum); if (map_urb_for_dma(hcd, urb, GFP_KERNEL)) { usb_put_urb(urb); usb_free_urb(urb); return NULL; } urb->context = done; return urb; } int ehset_single_step_set_feature(struct usb_hcd *hcd, int port) { int retval = -ENOMEM; struct usb_ctrlrequest *dr; struct urb *urb; struct usb_device *udev; struct usb_device_descriptor *buf; DECLARE_COMPLETION_ONSTACK(done); /* Obtain udev of the rhub's child port */ udev = usb_hub_find_child(hcd->self.root_hub, port); if (!udev) { dev_err(hcd->self.controller, "No device attached to the RootHub\n"); return -ENODEV; } buf = kmalloc(USB_DT_DEVICE_SIZE, GFP_KERNEL); if (!buf) return -ENOMEM; dr = kmalloc(sizeof(struct usb_ctrlrequest), GFP_KERNEL); if (!dr) { kfree(buf); return -ENOMEM; } /* Fill Setup packet for GetDescriptor */ dr->bRequestType = USB_DIR_IN; dr->bRequest = USB_REQ_GET_DESCRIPTOR; dr->wValue = cpu_to_le16(USB_DT_DEVICE << 8); dr->wIndex = 0; dr->wLength = cpu_to_le16(USB_DT_DEVICE_SIZE); urb = request_single_step_set_feature_urb(udev, dr, buf, &done); if (!urb) goto cleanup; /* Submit just the SETUP stage */ retval = hcd->driver->submit_single_step_set_feature(hcd, urb, 1); if (retval) goto out1; if (!wait_for_completion_timeout(&done, msecs_to_jiffies(2000))) { usb_kill_urb(urb); retval = -ETIMEDOUT; dev_err(hcd->self.controller, "%s SETUP stage timed out on ep0\n", __func__); goto out1; } msleep(15 * 1000); /* Complete remaining DATA and STATUS stages using the same URB */ urb->status = -EINPROGRESS; urb->transfer_flags &= ~URB_NO_TRANSFER_DMA_MAP; usb_get_urb(urb); atomic_inc(&urb->use_count); atomic_inc(&urb->dev->urbnum); if (map_urb_for_dma(hcd, urb, GFP_KERNEL)) { usb_put_urb(urb); goto out1; } retval = hcd->driver->submit_single_step_set_feature(hcd, urb, 0); if (!retval && !wait_for_completion_timeout(&done, msecs_to_jiffies(2000))) { usb_kill_urb(urb); retval = -ETIMEDOUT; dev_err(hcd->self.controller, "%s IN stage timed out on ep0\n", __func__); } out1: usb_free_urb(urb); cleanup: kfree(dr); kfree(buf); return retval; } EXPORT_SYMBOL_GPL(ehset_single_step_set_feature); #endif /* CONFIG_USB_HCD_TEST_MODE */ /*-------------------------------------------------------------------------*/ #ifdef CONFIG_PM int hcd_bus_suspend(struct usb_device *rhdev, pm_message_t msg) { struct usb_hcd *hcd = bus_to_hcd(rhdev->bus); int status; int old_state = hcd->state; dev_dbg(&rhdev->dev, "bus %ssuspend, wakeup %d\n", (PMSG_IS_AUTO(msg) ? "auto-" : ""), rhdev->do_remote_wakeup); if (HCD_DEAD(hcd)) { dev_dbg(&rhdev->dev, "skipped %s of dead bus\n", "suspend"); return 0; } if (!hcd->driver->bus_suspend) { status = -ENOENT; } else { clear_bit(HCD_FLAG_RH_RUNNING, &hcd->flags); hcd->state = HC_STATE_QUIESCING; status = hcd->driver->bus_suspend(hcd); } if (status == 0) { usb_set_device_state(rhdev, USB_STATE_SUSPENDED); hcd->state = HC_STATE_SUSPENDED; if (!PMSG_IS_AUTO(msg)) usb_phy_roothub_suspend(hcd->self.sysdev, hcd->phy_roothub); /* Did we race with a root-hub wakeup event? */ if (rhdev->do_remote_wakeup) { char buffer[6]; status = hcd->driver->hub_status_data(hcd, buffer); if (status != 0) { dev_dbg(&rhdev->dev, "suspend raced with wakeup event\n"); hcd_bus_resume(rhdev, PMSG_AUTO_RESUME); status = -EBUSY; } } } else { spin_lock_irq(&hcd_root_hub_lock); if (!HCD_DEAD(hcd)) { set_bit(HCD_FLAG_RH_RUNNING, &hcd->flags); hcd->state = old_state; } spin_unlock_irq(&hcd_root_hub_lock); dev_dbg(&rhdev->dev, "bus %s fail, err %d\n", "suspend", status); } return status; } int hcd_bus_resume(struct usb_device *rhdev, pm_message_t msg) { struct usb_hcd *hcd = bus_to_hcd(rhdev->bus); int status; int old_state = hcd->state; dev_dbg(&rhdev->dev, "usb %sresume\n", (PMSG_IS_AUTO(msg) ? "auto-" : "")); if (HCD_DEAD(hcd)) { dev_dbg(&rhdev->dev, "skipped %s of dead bus\n", "resume"); return 0; } if (!PMSG_IS_AUTO(msg)) { status = usb_phy_roothub_resume(hcd->self.sysdev, hcd->phy_roothub); if (status) return status; } if (!hcd->driver->bus_resume) return -ENOENT; if (HCD_RH_RUNNING(hcd)) return 0; hcd->state = HC_STATE_RESUMING; status = hcd->driver->bus_resume(hcd); clear_bit(HCD_FLAG_WAKEUP_PENDING, &hcd->flags); if (status == 0) status = usb_phy_roothub_calibrate(hcd->phy_roothub); if (status == 0) { struct usb_device *udev; int port1; spin_lock_irq(&hcd_root_hub_lock); if (!HCD_DEAD(hcd)) { usb_set_device_state(rhdev, rhdev->actconfig ? USB_STATE_CONFIGURED : USB_STATE_ADDRESS); set_bit(HCD_FLAG_RH_RUNNING, &hcd->flags); hcd->state = HC_STATE_RUNNING; } spin_unlock_irq(&hcd_root_hub_lock); /* * Check whether any of the enabled ports on the root hub are * unsuspended. If they are then a TRSMRCY delay is needed * (this is what the USB-2 spec calls a "global resume"). * Otherwise we can skip the delay. */ usb_hub_for_each_child(rhdev, port1, udev) { if (udev->state != USB_STATE_NOTATTACHED && !udev->port_is_suspended) { usleep_range(10000, 11000); /* TRSMRCY */ break; } } } else { hcd->state = old_state; usb_phy_roothub_suspend(hcd->self.sysdev, hcd->phy_roothub); dev_dbg(&rhdev->dev, "bus %s fail, err %d\n", "resume", status); if (status != -ESHUTDOWN) usb_hc_died(hcd); } return status; } /* Workqueue routine for root-hub remote wakeup */ static void hcd_resume_work(struct work_struct *work) { struct usb_hcd *hcd = container_of(work, struct usb_hcd, wakeup_work); struct usb_device *udev = hcd->self.root_hub; usb_remote_wakeup(udev); } /** * usb_hcd_resume_root_hub - called by HCD to resume its root hub * @hcd: host controller for this root hub * * The USB host controller calls this function when its root hub is * suspended (with the remote wakeup feature enabled) and a remote * wakeup request is received. The routine submits a workqueue request * to resume the root hub (that is, manage its downstream ports again). */ void usb_hcd_resume_root_hub (struct usb_hcd *hcd) { unsigned long flags; spin_lock_irqsave (&hcd_root_hub_lock, flags); if (hcd->rh_registered) { pm_wakeup_event(&hcd->self.root_hub->dev, 0); set_bit(HCD_FLAG_WAKEUP_PENDING, &hcd->flags); queue_work(pm_wq, &hcd->wakeup_work); } spin_unlock_irqrestore (&hcd_root_hub_lock, flags); } EXPORT_SYMBOL_GPL(usb_hcd_resume_root_hub); #endif /* CONFIG_PM */ /*-------------------------------------------------------------------------*/ #ifdef CONFIG_USB_OTG /** * usb_bus_start_enum - start immediate enumeration (for OTG) * @bus: the bus (must use hcd framework) * @port_num: 1-based number of port; usually bus->otg_port * Context: atomic * * Starts enumeration, with an immediate reset followed later by * hub_wq identifying and possibly configuring the device. * This is needed by OTG controller drivers, where it helps meet * HNP protocol timing requirements for starting a port reset. * * Return: 0 if successful. */ int usb_bus_start_enum(struct usb_bus *bus, unsigned port_num) { struct usb_hcd *hcd; int status = -EOPNOTSUPP; /* NOTE: since HNP can't start by grabbing the bus's address0_sem, * boards with root hubs hooked up to internal devices (instead of * just the OTG port) may need more attention to resetting... */ hcd = bus_to_hcd(bus); if (port_num && hcd->driver->start_port_reset) status = hcd->driver->start_port_reset(hcd, port_num); /* allocate hub_wq shortly after (first) root port reset finishes; * it may issue others, until at least 50 msecs have passed. */ if (status == 0) mod_timer(&hcd->rh_timer, jiffies + msecs_to_jiffies(10)); return status; } EXPORT_SYMBOL_GPL(usb_bus_start_enum); #endif /*-------------------------------------------------------------------------*/ /** * usb_hcd_irq - hook IRQs to HCD framework (bus glue) * @irq: the IRQ being raised * @__hcd: pointer to the HCD whose IRQ is being signaled * * If the controller isn't HALTed, calls the driver's irq handler. * Checks whether the controller is now dead. * * Return: %IRQ_HANDLED if the IRQ was handled. %IRQ_NONE otherwise. */ irqreturn_t usb_hcd_irq (int irq, void *__hcd) { struct usb_hcd *hcd = __hcd; irqreturn_t rc; if (unlikely(HCD_DEAD(hcd) || !HCD_HW_ACCESSIBLE(hcd))) rc = IRQ_NONE; else if (hcd->driver->irq(hcd) == IRQ_NONE) rc = IRQ_NONE; else rc = IRQ_HANDLED; return rc; } EXPORT_SYMBOL_GPL(usb_hcd_irq); /*-------------------------------------------------------------------------*/ /* Workqueue routine for when the root-hub has died. */ static void hcd_died_work(struct work_struct *work) { struct usb_hcd *hcd = container_of(work, struct usb_hcd, died_work); static char *env[] = { "ERROR=DEAD", NULL }; /* Notify user space that the host controller has died */ kobject_uevent_env(&hcd->self.root_hub->dev.kobj, KOBJ_OFFLINE, env); } /** * usb_hc_died - report abnormal shutdown of a host controller (bus glue) * @hcd: pointer to the HCD representing the controller * * This is called by bus glue to report a USB host controller that died * while operations may still have been pending. It's called automatically * by the PCI glue, so only glue for non-PCI busses should need to call it. * * Only call this function with the primary HCD. */ void usb_hc_died (struct usb_hcd *hcd) { unsigned long flags; dev_err (hcd->self.controller, "HC died; cleaning up\n"); spin_lock_irqsave (&hcd_root_hub_lock, flags); clear_bit(HCD_FLAG_RH_RUNNING, &hcd->flags); set_bit(HCD_FLAG_DEAD, &hcd->flags); if (hcd->rh_registered) { clear_bit(HCD_FLAG_POLL_RH, &hcd->flags); /* make hub_wq clean up old urbs and devices */ usb_set_device_state (hcd->self.root_hub, USB_STATE_NOTATTACHED); usb_kick_hub_wq(hcd->self.root_hub); } if (usb_hcd_is_primary_hcd(hcd) && hcd->shared_hcd) { hcd = hcd->shared_hcd; clear_bit(HCD_FLAG_RH_RUNNING, &hcd->flags); set_bit(HCD_FLAG_DEAD, &hcd->flags); if (hcd->rh_registered) { clear_bit(HCD_FLAG_POLL_RH, &hcd->flags); /* make hub_wq clean up old urbs and devices */ usb_set_device_state(hcd->self.root_hub, USB_STATE_NOTATTACHED); usb_kick_hub_wq(hcd->self.root_hub); } } /* Handle the case where this function gets called with a shared HCD */ if (usb_hcd_is_primary_hcd(hcd)) schedule_work(&hcd->died_work); else schedule_work(&hcd->primary_hcd->died_work); spin_unlock_irqrestore (&hcd_root_hub_lock, flags); /* Make sure that the other roothub is also deallocated. */ } EXPORT_SYMBOL_GPL (usb_hc_died); /*-------------------------------------------------------------------------*/ static void init_giveback_urb_bh(struct giveback_urb_bh *bh) { spin_lock_init(&bh->lock); INIT_LIST_HEAD(&bh->head); INIT_WORK(&bh->bh, usb_giveback_urb_bh); } struct usb_hcd *__usb_create_hcd(const struct hc_driver *driver, struct device *sysdev, struct device *dev, const char *bus_name, struct usb_hcd *primary_hcd) { struct usb_hcd *hcd; hcd = kzalloc(sizeof(*hcd) + driver->hcd_priv_size, GFP_KERNEL); if (!hcd) return NULL; if (primary_hcd == NULL) { hcd->address0_mutex = kmalloc(sizeof(*hcd->address0_mutex), GFP_KERNEL); if (!hcd->address0_mutex) { kfree(hcd); dev_dbg(dev, "hcd address0 mutex alloc failed\n"); return NULL; } mutex_init(hcd->address0_mutex); hcd->bandwidth_mutex = kmalloc(sizeof(*hcd->bandwidth_mutex), GFP_KERNEL); if (!hcd->bandwidth_mutex) { kfree(hcd->address0_mutex); kfree(hcd); dev_dbg(dev, "hcd bandwidth mutex alloc failed\n"); return NULL; } mutex_init(hcd->bandwidth_mutex); dev_set_drvdata(dev, hcd); } else { mutex_lock(&usb_port_peer_mutex); hcd->address0_mutex = primary_hcd->address0_mutex; hcd->bandwidth_mutex = primary_hcd->bandwidth_mutex; hcd->primary_hcd = primary_hcd; primary_hcd->primary_hcd = primary_hcd; hcd->shared_hcd = primary_hcd; primary_hcd->shared_hcd = hcd; mutex_unlock(&usb_port_peer_mutex); } kref_init(&hcd->kref); usb_bus_init(&hcd->self); hcd->self.controller = dev; hcd->self.sysdev = sysdev; hcd->self.bus_name = bus_name; timer_setup(&hcd->rh_timer, rh_timer_func, 0); #ifdef CONFIG_PM INIT_WORK(&hcd->wakeup_work, hcd_resume_work); #endif INIT_WORK(&hcd->died_work, hcd_died_work); hcd->driver = driver; hcd->speed = driver->flags & HCD_MASK; hcd->product_desc = (driver->product_desc) ? driver->product_desc : "USB Host Controller"; return hcd; } EXPORT_SYMBOL_GPL(__usb_create_hcd); /** * usb_create_shared_hcd - create and initialize an HCD structure * @driver: HC driver that will use this hcd * @dev: device for this HC, stored in hcd->self.controller * @bus_name: value to store in hcd->self.bus_name * @primary_hcd: a pointer to the usb_hcd structure that is sharing the * PCI device. Only allocate certain resources for the primary HCD * * Context: task context, might sleep. * * Allocate a struct usb_hcd, with extra space at the end for the * HC driver's private data. Initialize the generic members of the * hcd structure. * * Return: On success, a pointer to the created and initialized HCD structure. * On failure (e.g. if memory is unavailable), %NULL. */ struct usb_hcd *usb_create_shared_hcd(const struct hc_driver *driver, struct device *dev, const char *bus_name, struct usb_hcd *primary_hcd) { return __usb_create_hcd(driver, dev, dev, bus_name, primary_hcd); } EXPORT_SYMBOL_GPL(usb_create_shared_hcd); /** * usb_create_hcd - create and initialize an HCD structure * @driver: HC driver that will use this hcd * @dev: device for this HC, stored in hcd->self.controller * @bus_name: value to store in hcd->self.bus_name * * Context: task context, might sleep. * * Allocate a struct usb_hcd, with extra space at the end for the * HC driver's private data. Initialize the generic members of the * hcd structure. * * Return: On success, a pointer to the created and initialized HCD * structure. On failure (e.g. if memory is unavailable), %NULL. */ struct usb_hcd *usb_create_hcd(const struct hc_driver *driver, struct device *dev, const char *bus_name) { return __usb_create_hcd(driver, dev, dev, bus_name, NULL); } EXPORT_SYMBOL_GPL(usb_create_hcd); /* * Roothubs that share one PCI device must also share the bandwidth mutex. * Don't deallocate the bandwidth_mutex until the last shared usb_hcd is * deallocated. * * Make sure to deallocate the bandwidth_mutex only when the last HCD is * freed. When hcd_release() is called for either hcd in a peer set, * invalidate the peer's ->shared_hcd and ->primary_hcd pointers. */ static void hcd_release(struct kref *kref) { struct usb_hcd *hcd = container_of (kref, struct usb_hcd, kref); mutex_lock(&usb_port_peer_mutex); if (hcd->shared_hcd) { struct usb_hcd *peer = hcd->shared_hcd; peer->shared_hcd = NULL; peer->primary_hcd = NULL; } else { kfree(hcd->address0_mutex); kfree(hcd->bandwidth_mutex); } mutex_unlock(&usb_port_peer_mutex); kfree(hcd); } struct usb_hcd *usb_get_hcd (struct usb_hcd *hcd) { if (hcd) kref_get (&hcd->kref); return hcd; } EXPORT_SYMBOL_GPL(usb_get_hcd); void usb_put_hcd (struct usb_hcd *hcd) { if (hcd) kref_put (&hcd->kref, hcd_release); } EXPORT_SYMBOL_GPL(usb_put_hcd); int usb_hcd_is_primary_hcd(struct usb_hcd *hcd) { if (!hcd->primary_hcd) return 1; return hcd == hcd->primary_hcd; } EXPORT_SYMBOL_GPL(usb_hcd_is_primary_hcd); int usb_hcd_find_raw_port_number(struct usb_hcd *hcd, int port1) { if (!hcd->driver->find_raw_port_number) return port1; return hcd->driver->find_raw_port_number(hcd, port1); } static int usb_hcd_request_irqs(struct usb_hcd *hcd, unsigned int irqnum, unsigned long irqflags) { int retval; if (hcd->driver->irq) { snprintf(hcd->irq_descr, sizeof(hcd->irq_descr), "%s:usb%d", hcd->driver->description, hcd->self.busnum); retval = request_irq(irqnum, &usb_hcd_irq, irqflags, hcd->irq_descr, hcd); if (retval != 0) { dev_err(hcd->self.controller, "request interrupt %d failed\n", irqnum); return retval; } hcd->irq = irqnum; dev_info(hcd->self.controller, "irq %d, %s 0x%08llx\n", irqnum, (hcd->driver->flags & HCD_MEMORY) ? "io mem" : "io port", (unsigned long long)hcd->rsrc_start); } else { hcd->irq = 0; if (hcd->rsrc_start) dev_info(hcd->self.controller, "%s 0x%08llx\n", (hcd->driver->flags & HCD_MEMORY) ? "io mem" : "io port", (unsigned long long)hcd->rsrc_start); } return 0; } /* * Before we free this root hub, flush in-flight peering attempts * and disable peer lookups */ static void usb_put_invalidate_rhdev(struct usb_hcd *hcd) { struct usb_device *rhdev; mutex_lock(&usb_port_peer_mutex); rhdev = hcd->self.root_hub; hcd->self.root_hub = NULL; mutex_unlock(&usb_port_peer_mutex); usb_put_dev(rhdev); } /** * usb_stop_hcd - Halt the HCD * @hcd: the usb_hcd that has to be halted * * Stop the root-hub polling timer and invoke the HCD's ->stop callback. */ static void usb_stop_hcd(struct usb_hcd *hcd) { hcd->rh_pollable = 0; clear_bit(HCD_FLAG_POLL_RH, &hcd->flags); timer_delete_sync(&hcd->rh_timer); hcd->driver->stop(hcd); hcd->state = HC_STATE_HALT; /* In case the HCD restarted the timer, stop it again. */ clear_bit(HCD_FLAG_POLL_RH, &hcd->flags); timer_delete_sync(&hcd->rh_timer); } /** * usb_add_hcd - finish generic HCD structure initialization and register * @hcd: the usb_hcd structure to initialize * @irqnum: Interrupt line to allocate * @irqflags: Interrupt type flags * * Finish the remaining parts of generic HCD initialization: allocate the * buffers of consistent memory, register the bus, request the IRQ line, * and call the driver's reset() and start() routines. */ int usb_add_hcd(struct usb_hcd *hcd, unsigned int irqnum, unsigned long irqflags) { int retval; struct usb_device *rhdev; struct usb_hcd *shared_hcd; int skip_phy_initialization; if (usb_hcd_is_primary_hcd(hcd)) skip_phy_initialization = hcd->skip_phy_initialization; else skip_phy_initialization = hcd->primary_hcd->skip_phy_initialization; if (!skip_phy_initialization) { if (usb_hcd_is_primary_hcd(hcd)) { hcd->phy_roothub = usb_phy_roothub_alloc(hcd->self.sysdev); if (IS_ERR(hcd->phy_roothub)) return PTR_ERR(hcd->phy_roothub); } else { hcd->phy_roothub = usb_phy_roothub_alloc_usb3_phy(hcd->self.sysdev); if (IS_ERR(hcd->phy_roothub)) return PTR_ERR(hcd->phy_roothub); } retval = usb_phy_roothub_init(hcd->phy_roothub); if (retval) return retval; retval = usb_phy_roothub_set_mode(hcd->phy_roothub, PHY_MODE_USB_HOST_SS); if (retval) retval = usb_phy_roothub_set_mode(hcd->phy_roothub, PHY_MODE_USB_HOST); if (retval) goto err_usb_phy_roothub_power_on; retval = usb_phy_roothub_power_on(hcd->phy_roothub); if (retval) goto err_usb_phy_roothub_power_on; } dev_info(hcd->self.controller, "%s\n", hcd->product_desc); switch (authorized_default) { case USB_AUTHORIZE_NONE: hcd->dev_policy = USB_DEVICE_AUTHORIZE_NONE; break; case USB_AUTHORIZE_INTERNAL: hcd->dev_policy = USB_DEVICE_AUTHORIZE_INTERNAL; break; case USB_AUTHORIZE_ALL: case USB_AUTHORIZE_WIRED: default: hcd->dev_policy = USB_DEVICE_AUTHORIZE_ALL; break; } set_bit(HCD_FLAG_HW_ACCESSIBLE, &hcd->flags); /* per default all interfaces are authorized */ set_bit(HCD_FLAG_INTF_AUTHORIZED, &hcd->flags); /* HC is in reset state, but accessible. Now do the one-time init, * bottom up so that hcds can customize the root hubs before hub_wq * starts talking to them. (Note, bus id is assigned early too.) */ retval = hcd_buffer_create(hcd); if (retval != 0) { dev_dbg(hcd->self.sysdev, "pool alloc failed\n"); goto err_create_buf; } retval = usb_register_bus(&hcd->self); if (retval < 0) goto err_register_bus; rhdev = usb_alloc_dev(NULL, &hcd->self, 0); if (rhdev == NULL) { dev_err(hcd->self.sysdev, "unable to allocate root hub\n"); retval = -ENOMEM; goto err_allocate_root_hub; } mutex_lock(&usb_port_peer_mutex); hcd->self.root_hub = rhdev; mutex_unlock(&usb_port_peer_mutex); rhdev->rx_lanes = 1; rhdev->tx_lanes = 1; rhdev->ssp_rate = USB_SSP_GEN_UNKNOWN; switch (hcd->speed) { case HCD_USB11: rhdev->speed = USB_SPEED_FULL; break; case HCD_USB2: rhdev->speed = USB_SPEED_HIGH; break; case HCD_USB3: rhdev->speed = USB_SPEED_SUPER; break; case HCD_USB32: rhdev->rx_lanes = 2; rhdev->tx_lanes = 2; rhdev->ssp_rate = USB_SSP_GEN_2x2; rhdev->speed = USB_SPEED_SUPER_PLUS; break; case HCD_USB31: rhdev->ssp_rate = USB_SSP_GEN_2x1; rhdev->speed = USB_SPEED_SUPER_PLUS; break; default: retval = -EINVAL; goto err_set_rh_speed; } /* wakeup flag init defaults to "everything works" for root hubs, * but drivers can override it in reset() if needed, along with * recording the overall controller's system wakeup capability. */ device_set_wakeup_capable(&rhdev->dev, 1); /* HCD_FLAG_RH_RUNNING doesn't matter until the root hub is * registered. But since the controller can die at any time, * let's initialize the flag before touching the hardware. */ set_bit(HCD_FLAG_RH_RUNNING, &hcd->flags); /* "reset" is misnamed; its role is now one-time init. the controller * should already have been reset (and boot firmware kicked off etc). */ if (hcd->driver->reset) { retval = hcd->driver->reset(hcd); if (retval < 0) { dev_err(hcd->self.controller, "can't setup: %d\n", retval); goto err_hcd_driver_setup; } } hcd->rh_pollable = 1; retval = usb_phy_roothub_calibrate(hcd->phy_roothub); if (retval) goto err_hcd_driver_setup; /* NOTE: root hub and controller capabilities may not be the same */ if (device_can_wakeup(hcd->self.controller) && device_can_wakeup(&hcd->self.root_hub->dev)) dev_dbg(hcd->self.controller, "supports USB remote wakeup\n"); /* initialize BHs */ init_giveback_urb_bh(&hcd->high_prio_bh); hcd->high_prio_bh.high_prio = true; init_giveback_urb_bh(&hcd->low_prio_bh); /* enable irqs just before we start the controller, * if the BIOS provides legacy PCI irqs. */ if (usb_hcd_is_primary_hcd(hcd) && irqnum) { retval = usb_hcd_request_irqs(hcd, irqnum, irqflags); if (retval) goto err_request_irq; } hcd->state = HC_STATE_RUNNING; retval = hcd->driver->start(hcd); if (retval < 0) { dev_err(hcd->self.controller, "startup error %d\n", retval); goto err_hcd_driver_start; } /* starting here, usbcore will pay attention to the shared HCD roothub */ shared_hcd = hcd->shared_hcd; if (!usb_hcd_is_primary_hcd(hcd) && shared_hcd && HCD_DEFER_RH_REGISTER(shared_hcd)) { retval = register_root_hub(shared_hcd); if (retval != 0) goto err_register_root_hub; if (shared_hcd->uses_new_polling && HCD_POLL_RH(shared_hcd)) usb_hcd_poll_rh_status(shared_hcd); } /* starting here, usbcore will pay attention to this root hub */ if (!HCD_DEFER_RH_REGISTER(hcd)) { retval = register_root_hub(hcd); if (retval != 0) goto err_register_root_hub; if (hcd->uses_new_polling && HCD_POLL_RH(hcd)) usb_hcd_poll_rh_status(hcd); } return retval; err_register_root_hub: usb_stop_hcd(hcd); err_hcd_driver_start: if (usb_hcd_is_primary_hcd(hcd) && hcd->irq > 0) free_irq(irqnum, hcd); err_request_irq: err_hcd_driver_setup: err_set_rh_speed: usb_put_invalidate_rhdev(hcd); err_allocate_root_hub: usb_deregister_bus(&hcd->self); err_register_bus: hcd_buffer_destroy(hcd); err_create_buf: usb_phy_roothub_power_off(hcd->phy_roothub); err_usb_phy_roothub_power_on: usb_phy_roothub_exit(hcd->phy_roothub); return retval; } EXPORT_SYMBOL_GPL(usb_add_hcd); /** * usb_remove_hcd - shutdown processing for generic HCDs * @hcd: the usb_hcd structure to remove * * Context: task context, might sleep. * * Disconnects the root hub, then reverses the effects of usb_add_hcd(), * invoking the HCD's stop() method. */ void usb_remove_hcd(struct usb_hcd *hcd) { struct usb_device *rhdev; bool rh_registered; if (!hcd) { pr_debug("%s: hcd is NULL\n", __func__); return; } rhdev = hcd->self.root_hub; dev_info(hcd->self.controller, "remove, state %x\n", hcd->state); usb_get_dev(rhdev); clear_bit(HCD_FLAG_RH_RUNNING, &hcd->flags); if (HC_IS_RUNNING (hcd->state)) hcd->state = HC_STATE_QUIESCING; dev_dbg(hcd->self.controller, "roothub graceful disconnect\n"); spin_lock_irq (&hcd_root_hub_lock); rh_registered = hcd->rh_registered; hcd->rh_registered = 0; spin_unlock_irq (&hcd_root_hub_lock); #ifdef CONFIG_PM cancel_work_sync(&hcd->wakeup_work); #endif cancel_work_sync(&hcd->died_work); mutex_lock(&usb_bus_idr_lock); if (rh_registered) usb_disconnect(&rhdev); /* Sets rhdev to NULL */ mutex_unlock(&usb_bus_idr_lock); /* * flush_work() isn't needed here because: * - driver's disconnect() called from usb_disconnect() should * make sure its URBs are completed during the disconnect() * callback * * - it is too late to run complete() here since driver may have * been removed already now */ /* Prevent any more root-hub status calls from the timer. * The HCD might still restart the timer (if a port status change * interrupt occurs), but usb_hcd_poll_rh_status() won't invoke * the hub_status_data() callback. */ usb_stop_hcd(hcd); if (usb_hcd_is_primary_hcd(hcd)) { if (hcd->irq > 0) free_irq(hcd->irq, hcd); } usb_deregister_bus(&hcd->self); hcd_buffer_destroy(hcd); usb_phy_roothub_power_off(hcd->phy_roothub); usb_phy_roothub_exit(hcd->phy_roothub); usb_put_invalidate_rhdev(hcd); hcd->flags = 0; } EXPORT_SYMBOL_GPL(usb_remove_hcd); void usb_hcd_platform_shutdown(struct platform_device *dev) { struct usb_hcd *hcd = platform_get_drvdata(dev); /* No need for pm_runtime_put(), we're shutting down */ pm_runtime_get_sync(&dev->dev); if (hcd->driver->shutdown) hcd->driver->shutdown(hcd); } EXPORT_SYMBOL_GPL(usb_hcd_platform_shutdown); int usb_hcd_setup_local_mem(struct usb_hcd *hcd, phys_addr_t phys_addr, dma_addr_t dma, size_t size) { int err; void *local_mem; hcd->localmem_pool = devm_gen_pool_create(hcd->self.sysdev, 4, dev_to_node(hcd->self.sysdev), dev_name(hcd->self.sysdev)); if (IS_ERR(hcd->localmem_pool)) return PTR_ERR(hcd->localmem_pool); /* * if a physical SRAM address was passed, map it, otherwise * allocate system memory as a buffer. */ if (phys_addr) local_mem = devm_memremap(hcd->self.sysdev, phys_addr, size, MEMREMAP_WC); else local_mem = dmam_alloc_attrs(hcd->self.sysdev, size, &dma, GFP_KERNEL, DMA_ATTR_WRITE_COMBINE); if (IS_ERR_OR_NULL(local_mem)) { if (!local_mem) return -ENOMEM; return PTR_ERR(local_mem); } /* * Here we pass a dma_addr_t but the arg type is a phys_addr_t. * It's not backed by system memory and thus there's no kernel mapping * for it. */ err = gen_pool_add_virt(hcd->localmem_pool, (unsigned long)local_mem, dma, size, dev_to_node(hcd->self.sysdev)); if (err < 0) { dev_err(hcd->self.sysdev, "gen_pool_add_virt failed with %d\n", err); return err; } return 0; } EXPORT_SYMBOL_GPL(usb_hcd_setup_local_mem); /*-------------------------------------------------------------------------*/ #if IS_ENABLED(CONFIG_USB_MON) const struct usb_mon_operations *mon_ops; /* * The registration is unlocked. * We do it this way because we do not want to lock in hot paths. * * Notice that the code is minimally error-proof. Because usbmon needs * symbols from usbcore, usbcore gets referenced and cannot be unloaded first. */ int usb_mon_register(const struct usb_mon_operations *ops) { if (mon_ops) return -EBUSY; mon_ops = ops; mb(); return 0; } EXPORT_SYMBOL_GPL (usb_mon_register); void usb_mon_deregister (void) { if (mon_ops == NULL) { printk(KERN_ERR "USB: monitor was not registered\n"); return; } mon_ops = NULL; mb(); } EXPORT_SYMBOL_GPL (usb_mon_deregister); #endif /* CONFIG_USB_MON || CONFIG_USB_MON_MODULE */ |
| 104 104 104 104 104 104 104 104 104 104 104 104 104 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 | // SPDX-License-Identifier: GPL-2.0 /* * driver.c - centralized device driver management * * Copyright (c) 2002-3 Patrick Mochel * Copyright (c) 2002-3 Open Source Development Labs * Copyright (c) 2007 Greg Kroah-Hartman <gregkh@suse.de> * Copyright (c) 2007 Novell Inc. */ #include <linux/device/driver.h> #include <linux/device.h> #include <linux/module.h> #include <linux/errno.h> #include <linux/slab.h> #include <linux/string.h> #include <linux/sysfs.h> #include "base.h" static struct device *next_device(struct klist_iter *i) { struct klist_node *n = klist_next(i); struct device *dev = NULL; struct device_private *dev_prv; if (n) { dev_prv = to_device_private_driver(n); dev = dev_prv->device; } return dev; } /** * driver_set_override() - Helper to set or clear driver override. * @dev: Device to change * @override: Address of string to change (e.g. &device->driver_override); * The contents will be freed and hold newly allocated override. * @s: NUL-terminated string, new driver name to force a match, pass empty * string to clear it ("" or "\n", where the latter is only for sysfs * interface). * @len: length of @s * * Helper to set or clear driver override in a device, intended for the cases * when the driver_override field is allocated by driver/bus code. * * Returns: 0 on success or a negative error code on failure. */ int driver_set_override(struct device *dev, const char **override, const char *s, size_t len) { const char *new, *old; char *cp; if (!override || !s) return -EINVAL; /* * The stored value will be used in sysfs show callback (sysfs_emit()), * which has a length limit of PAGE_SIZE and adds a trailing newline. * Thus we can store one character less to avoid truncation during sysfs * show. */ if (len >= (PAGE_SIZE - 1)) return -EINVAL; /* * Compute the real length of the string in case userspace sends us a * bunch of \0 characters like python likes to do. */ len = strlen(s); if (!len) { /* Empty string passed - clear override */ device_lock(dev); old = *override; *override = NULL; device_unlock(dev); kfree(old); return 0; } cp = strnchr(s, len, '\n'); if (cp) len = cp - s; new = kstrndup(s, len, GFP_KERNEL); if (!new) return -ENOMEM; device_lock(dev); old = *override; if (cp != s) { *override = new; } else { /* "\n" passed - clear override */ kfree(new); *override = NULL; } device_unlock(dev); kfree(old); return 0; } EXPORT_SYMBOL_GPL(driver_set_override); /** * driver_for_each_device - Iterator for devices bound to a driver. * @drv: Driver we're iterating. * @start: Device to begin with * @data: Data to pass to the callback. * @fn: Function to call for each device. * * Iterate over the @drv's list of devices calling @fn for each one. */ int driver_for_each_device(struct device_driver *drv, struct device *start, void *data, device_iter_t fn) { struct klist_iter i; struct device *dev; int error = 0; if (!drv) return -EINVAL; klist_iter_init_node(&drv->p->klist_devices, &i, start ? &start->p->knode_driver : NULL); while (!error && (dev = next_device(&i))) error = fn(dev, data); klist_iter_exit(&i); return error; } EXPORT_SYMBOL_GPL(driver_for_each_device); /** * driver_find_device - device iterator for locating a particular device. * @drv: The device's driver * @start: Device to begin with * @data: Data to pass to match function * @match: Callback function to check device * * This is similar to the driver_for_each_device() function above, but * it returns a reference to a device that is 'found' for later use, as * determined by the @match callback. * * The callback should return 0 if the device doesn't match and non-zero * if it does. If the callback returns non-zero, this function will * return to the caller and not iterate over any more devices. */ struct device *driver_find_device(const struct device_driver *drv, struct device *start, const void *data, device_match_t match) { struct klist_iter i; struct device *dev; if (!drv || !drv->p) return NULL; klist_iter_init_node(&drv->p->klist_devices, &i, (start ? &start->p->knode_driver : NULL)); while ((dev = next_device(&i))) { if (match(dev, data)) { get_device(dev); break; } } klist_iter_exit(&i); return dev; } EXPORT_SYMBOL_GPL(driver_find_device); /** * driver_create_file - create sysfs file for driver. * @drv: driver. * @attr: driver attribute descriptor. */ int driver_create_file(const struct device_driver *drv, const struct driver_attribute *attr) { int error; if (drv) error = sysfs_create_file(&drv->p->kobj, &attr->attr); else error = -EINVAL; return error; } EXPORT_SYMBOL_GPL(driver_create_file); /** * driver_remove_file - remove sysfs file for driver. * @drv: driver. * @attr: driver attribute descriptor. */ void driver_remove_file(const struct device_driver *drv, const struct driver_attribute *attr) { if (drv) sysfs_remove_file(&drv->p->kobj, &attr->attr); } EXPORT_SYMBOL_GPL(driver_remove_file); int driver_add_groups(const struct device_driver *drv, const struct attribute_group **groups) { return sysfs_create_groups(&drv->p->kobj, groups); } void driver_remove_groups(const struct device_driver *drv, const struct attribute_group **groups) { sysfs_remove_groups(&drv->p->kobj, groups); } /** * driver_register - register driver with bus * @drv: driver to register * * We pass off most of the work to the bus_add_driver() call, * since most of the things we have to do deal with the bus * structures. */ int driver_register(struct device_driver *drv) { int ret; struct device_driver *other; if (!bus_is_registered(drv->bus)) { pr_err("Driver '%s' was unable to register with bus_type '%s' because the bus was not initialized.\n", drv->name, drv->bus->name); return -EINVAL; } if ((drv->bus->probe && drv->probe) || (drv->bus->remove && drv->remove) || (drv->bus->shutdown && drv->shutdown)) pr_warn("Driver '%s' needs updating - please use " "bus_type methods\n", drv->name); other = driver_find(drv->name, drv->bus); if (other) { pr_err("Error: Driver '%s' is already registered, " "aborting...\n", drv->name); return -EBUSY; } ret = bus_add_driver(drv); if (ret) return ret; ret = driver_add_groups(drv, drv->groups); if (ret) { bus_remove_driver(drv); return ret; } kobject_uevent(&drv->p->kobj, KOBJ_ADD); deferred_probe_extend_timeout(); return ret; } EXPORT_SYMBOL_GPL(driver_register); /** * driver_unregister - remove driver from system. * @drv: driver. * * Again, we pass off most of the work to the bus-level call. */ void driver_unregister(struct device_driver *drv) { if (!drv || !drv->p) { WARN(1, "Unexpected driver unregister!\n"); return; } driver_remove_groups(drv, drv->groups); bus_remove_driver(drv); } EXPORT_SYMBOL_GPL(driver_unregister); |
| 280 262 2 168 35 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 | /* * Copyright (c) 1982, 1986 Regents of the University of California. * All rights reserved. * * This code is derived from software contributed to Berkeley by * Robert Elz at The University of Melbourne. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #ifndef _LINUX_QUOTA_ #define _LINUX_QUOTA_ #include <linux/list.h> #include <linux/mutex.h> #include <linux/rwsem.h> #include <linux/spinlock.h> #include <linux/wait.h> #include <linux/percpu_counter.h> #include <linux/dqblk_xfs.h> #include <linux/dqblk_v1.h> #include <linux/dqblk_v2.h> #include <linux/atomic.h> #include <linux/uidgid.h> #include <linux/projid.h> #include <uapi/linux/quota.h> #undef USRQUOTA #undef GRPQUOTA #undef PRJQUOTA enum quota_type { USRQUOTA = 0, /* element used for user quotas */ GRPQUOTA = 1, /* element used for group quotas */ PRJQUOTA = 2, /* element used for project quotas */ }; /* Masks for quota types when used as a bitmask */ #define QTYPE_MASK_USR (1 << USRQUOTA) #define QTYPE_MASK_GRP (1 << GRPQUOTA) #define QTYPE_MASK_PRJ (1 << PRJQUOTA) typedef __kernel_uid32_t qid_t; /* Type in which we store ids in memory */ typedef long long qsize_t; /* Type in which we store sizes */ struct kqid { /* Type in which we store the quota identifier */ union { kuid_t uid; kgid_t gid; kprojid_t projid; }; enum quota_type type; /* USRQUOTA (uid) or GRPQUOTA (gid) or PRJQUOTA (projid) */ }; extern bool qid_eq(struct kqid left, struct kqid right); extern bool qid_lt(struct kqid left, struct kqid right); extern qid_t from_kqid(struct user_namespace *to, struct kqid qid); extern qid_t from_kqid_munged(struct user_namespace *to, struct kqid qid); extern bool qid_valid(struct kqid qid); /** * make_kqid - Map a user-namespace, type, qid tuple into a kqid. * @from: User namespace that the qid is in * @type: The type of quota * @qid: Quota identifier * * Maps a user-namespace, type qid tuple into a kernel internal * kqid, and returns that kqid. * * When there is no mapping defined for the user-namespace, type, * qid tuple an invalid kqid is returned. Callers are expected to * test for and handle invalid kqids being returned. * Invalid kqids may be tested for using qid_valid(). */ static inline struct kqid make_kqid(struct user_namespace *from, enum quota_type type, qid_t qid) { struct kqid kqid; kqid.type = type; switch (type) { case USRQUOTA: kqid.uid = make_kuid(from, qid); break; case GRPQUOTA: kqid.gid = make_kgid(from, qid); break; case PRJQUOTA: kqid.projid = make_kprojid(from, qid); break; default: BUG(); } return kqid; } /** * make_kqid_invalid - Explicitly make an invalid kqid * @type: The type of quota identifier * * Returns an invalid kqid with the specified type. */ static inline struct kqid make_kqid_invalid(enum quota_type type) { struct kqid kqid; kqid.type = type; switch (type) { case USRQUOTA: kqid.uid = INVALID_UID; break; case GRPQUOTA: kqid.gid = INVALID_GID; break; case PRJQUOTA: kqid.projid = INVALID_PROJID; break; default: BUG(); } return kqid; } /** * make_kqid_uid - Make a kqid from a kuid * @uid: The kuid to make the quota identifier from */ static inline struct kqid make_kqid_uid(kuid_t uid) { struct kqid kqid; kqid.type = USRQUOTA; kqid.uid = uid; return kqid; } /** * make_kqid_gid - Make a kqid from a kgid * @gid: The kgid to make the quota identifier from */ static inline struct kqid make_kqid_gid(kgid_t gid) { struct kqid kqid; kqid.type = GRPQUOTA; kqid.gid = gid; return kqid; } /** * make_kqid_projid - Make a kqid from a projid * @projid: The kprojid to make the quota identifier from */ static inline struct kqid make_kqid_projid(kprojid_t projid) { struct kqid kqid; kqid.type = PRJQUOTA; kqid.projid = projid; return kqid; } /** * qid_has_mapping - Report if a qid maps into a user namespace. * @ns: The user namespace to see if a value maps into. * @qid: The kernel internal quota identifier to test. */ static inline bool qid_has_mapping(struct user_namespace *ns, struct kqid qid) { return from_kqid(ns, qid) != (qid_t) -1; } extern spinlock_t dq_data_lock; /* Maximal numbers of writes for quota operation (insert/delete/update) * (over VFS all formats) */ #define DQUOT_INIT_ALLOC max(V1_INIT_ALLOC, V2_INIT_ALLOC) #define DQUOT_INIT_REWRITE max(V1_INIT_REWRITE, V2_INIT_REWRITE) #define DQUOT_DEL_ALLOC max(V1_DEL_ALLOC, V2_DEL_ALLOC) #define DQUOT_DEL_REWRITE max(V1_DEL_REWRITE, V2_DEL_REWRITE) /* * Data for one user/group kept in memory */ struct mem_dqblk { qsize_t dqb_bhardlimit; /* absolute limit on disk blks alloc */ qsize_t dqb_bsoftlimit; /* preferred limit on disk blks */ qsize_t dqb_curspace; /* current used space */ qsize_t dqb_rsvspace; /* current reserved space for delalloc*/ qsize_t dqb_ihardlimit; /* absolute limit on allocated inodes */ qsize_t dqb_isoftlimit; /* preferred inode limit */ qsize_t dqb_curinodes; /* current # allocated inodes */ time64_t dqb_btime; /* time limit for excessive disk use */ time64_t dqb_itime; /* time limit for excessive inode use */ }; /* * Data for one quotafile kept in memory */ struct quota_format_type; struct mem_dqinfo { struct quota_format_type *dqi_format; int dqi_fmt_id; /* Id of the dqi_format - used when turning * quotas on after remount RW */ struct list_head dqi_dirty_list; /* List of dirty dquots [dq_list_lock] */ unsigned long dqi_flags; /* DFQ_ flags [dq_data_lock] */ unsigned int dqi_bgrace; /* Space grace time [dq_data_lock] */ unsigned int dqi_igrace; /* Inode grace time [dq_data_lock] */ qsize_t dqi_max_spc_limit; /* Maximum space limit [static] */ qsize_t dqi_max_ino_limit; /* Maximum inode limit [static] */ void *dqi_priv; }; struct super_block; /* Mask for flags passed to userspace */ #define DQF_GETINFO_MASK (DQF_ROOT_SQUASH | DQF_SYS_FILE) /* Mask for flags modifiable from userspace */ #define DQF_SETINFO_MASK DQF_ROOT_SQUASH enum { DQF_INFO_DIRTY_B = DQF_PRIVATE, }; #define DQF_INFO_DIRTY (1 << DQF_INFO_DIRTY_B) /* Is info dirty? */ extern void mark_info_dirty(struct super_block *sb, int type); static inline int info_dirty(struct mem_dqinfo *info) { return test_bit(DQF_INFO_DIRTY_B, &info->dqi_flags); } enum { DQST_LOOKUPS, DQST_DROPS, DQST_READS, DQST_WRITES, DQST_CACHE_HITS, DQST_ALLOC_DQUOTS, DQST_FREE_DQUOTS, DQST_SYNCS, _DQST_DQSTAT_LAST }; struct dqstats { unsigned long stat[_DQST_DQSTAT_LAST]; struct percpu_counter counter[_DQST_DQSTAT_LAST]; }; extern struct dqstats dqstats; static inline void dqstats_inc(unsigned int type) { percpu_counter_inc(&dqstats.counter[type]); } static inline void dqstats_dec(unsigned int type) { percpu_counter_dec(&dqstats.counter[type]); } #define DQ_MOD_B 0 /* dquot modified since read */ #define DQ_BLKS_B 1 /* uid/gid has been warned about blk limit */ #define DQ_INODES_B 2 /* uid/gid has been warned about inode limit */ #define DQ_FAKE_B 3 /* no limits only usage */ #define DQ_READ_B 4 /* dquot was read into memory */ #define DQ_ACTIVE_B 5 /* dquot is active (dquot_release not called) */ #define DQ_RELEASING_B 6 /* dquot is in releasing_dquots list waiting * to be cleaned up */ #define DQ_LASTSET_B 7 /* Following 6 bits (see QIF_) are reserved\ * for the mask of entries set via SETQUOTA\ * quotactl. They are set under dq_data_lock\ * and the quota format handling dquot can\ * clear them when it sees fit. */ struct dquot { struct hlist_node dq_hash; /* Hash list in memory [dq_list_lock] */ struct list_head dq_inuse; /* List of all quotas [dq_list_lock] */ struct list_head dq_free; /* Free list element [dq_list_lock] */ struct list_head dq_dirty; /* List of dirty dquots [dq_list_lock] */ struct mutex dq_lock; /* dquot IO lock */ spinlock_t dq_dqb_lock; /* Lock protecting dq_dqb changes */ atomic_t dq_count; /* Use count */ struct super_block *dq_sb; /* superblock this applies to */ struct kqid dq_id; /* ID this applies to (uid, gid, projid) */ loff_t dq_off; /* Offset of dquot on disk [dq_lock, stable once set] */ unsigned long dq_flags; /* See DQ_* */ struct mem_dqblk dq_dqb; /* Diskquota usage [dq_dqb_lock] */ }; /* Operations which must be implemented by each quota format */ struct quota_format_ops { int (*check_quota_file)(struct super_block *sb, int type); /* Detect whether file is in our format */ int (*read_file_info)(struct super_block *sb, int type); /* Read main info about file - called on quotaon() */ int (*write_file_info)(struct super_block *sb, int type); /* Write main info about file */ int (*free_file_info)(struct super_block *sb, int type); /* Called on quotaoff() */ int (*read_dqblk)(struct dquot *dquot); /* Read structure for one user */ int (*commit_dqblk)(struct dquot *dquot); /* Write structure for one user */ int (*release_dqblk)(struct dquot *dquot); /* Called when last reference to dquot is being dropped */ int (*get_next_id)(struct super_block *sb, struct kqid *qid); /* Get next ID with existing structure in the quota file */ }; /* Operations working with dquots */ struct dquot_operations { int (*write_dquot) (struct dquot *); /* Ordinary dquot write */ struct dquot *(*alloc_dquot)(struct super_block *, int); /* Allocate memory for new dquot */ void (*destroy_dquot)(struct dquot *); /* Free memory for dquot */ int (*acquire_dquot) (struct dquot *); /* Quota is going to be created on disk */ int (*release_dquot) (struct dquot *); /* Quota is going to be deleted from disk */ int (*mark_dirty) (struct dquot *); /* Dquot is marked dirty */ int (*write_info) (struct super_block *, int); /* Write of quota "superblock" */ /* get reserved quota for delayed alloc, value returned is managed by * quota code only */ qsize_t *(*get_reserved_space) (struct inode *); int (*get_projid) (struct inode *, kprojid_t *);/* Get project ID */ /* Get number of inodes that were charged for a given inode */ int (*get_inode_usage) (struct inode *, qsize_t *); /* Get next ID with active quota structure */ int (*get_next_id) (struct super_block *sb, struct kqid *qid); }; struct path; /* Structure for communicating via ->get_dqblk() & ->set_dqblk() */ struct qc_dqblk { int d_fieldmask; /* mask of fields to change in ->set_dqblk() */ u64 d_spc_hardlimit; /* absolute limit on used space */ u64 d_spc_softlimit; /* preferred limit on used space */ u64 d_ino_hardlimit; /* maximum # allocated inodes */ u64 d_ino_softlimit; /* preferred inode limit */ u64 d_space; /* Space owned by the user */ u64 d_ino_count; /* # inodes owned by the user */ s64 d_ino_timer; /* zero if within inode limits */ /* if not, we refuse service */ s64 d_spc_timer; /* similar to above; for space */ int d_ino_warns; /* # warnings issued wrt num inodes */ int d_spc_warns; /* # warnings issued wrt used space */ u64 d_rt_spc_hardlimit; /* absolute limit on realtime space */ u64 d_rt_spc_softlimit; /* preferred limit on RT space */ u64 d_rt_space; /* realtime space owned */ s64 d_rt_spc_timer; /* similar to above; for RT space */ int d_rt_spc_warns; /* # warnings issued wrt RT space */ }; /* * Field specifiers for ->set_dqblk() in struct qc_dqblk and also for * ->set_info() in struct qc_info */ #define QC_INO_SOFT (1<<0) #define QC_INO_HARD (1<<1) #define QC_SPC_SOFT (1<<2) #define QC_SPC_HARD (1<<3) #define QC_RT_SPC_SOFT (1<<4) #define QC_RT_SPC_HARD (1<<5) #define QC_LIMIT_MASK (QC_INO_SOFT | QC_INO_HARD | QC_SPC_SOFT | QC_SPC_HARD | \ QC_RT_SPC_SOFT | QC_RT_SPC_HARD) #define QC_SPC_TIMER (1<<6) #define QC_INO_TIMER (1<<7) #define QC_RT_SPC_TIMER (1<<8) #define QC_TIMER_MASK (QC_SPC_TIMER | QC_INO_TIMER | QC_RT_SPC_TIMER) #define QC_SPC_WARNS (1<<9) #define QC_INO_WARNS (1<<10) #define QC_RT_SPC_WARNS (1<<11) #define QC_WARNS_MASK (QC_SPC_WARNS | QC_INO_WARNS | QC_RT_SPC_WARNS) #define QC_SPACE (1<<12) #define QC_INO_COUNT (1<<13) #define QC_RT_SPACE (1<<14) #define QC_ACCT_MASK (QC_SPACE | QC_INO_COUNT | QC_RT_SPACE) #define QC_FLAGS (1<<15) #define QCI_SYSFILE (1 << 0) /* Quota file is hidden from userspace */ #define QCI_ROOT_SQUASH (1 << 1) /* Root squash turned on */ #define QCI_ACCT_ENABLED (1 << 2) /* Quota accounting enabled */ #define QCI_LIMITS_ENFORCED (1 << 3) /* Quota limits enforced */ /* Structures for communicating via ->get_state */ struct qc_type_state { unsigned int flags; /* Flags QCI_* */ unsigned int spc_timelimit; /* Time after which space softlimit is * enforced */ unsigned int ino_timelimit; /* Ditto for inode softlimit */ unsigned int rt_spc_timelimit; /* Ditto for real-time space */ unsigned int spc_warnlimit; /* Limit for number of space warnings */ unsigned int ino_warnlimit; /* Ditto for inodes */ unsigned int rt_spc_warnlimit; /* Ditto for real-time space */ unsigned long long ino; /* Inode number of quota file */ blkcnt_t blocks; /* Number of 512-byte blocks in the file */ blkcnt_t nextents; /* Number of extents in the file */ }; struct qc_state { unsigned int s_incoredqs; /* Number of dquots in core */ struct qc_type_state s_state[MAXQUOTAS]; /* Per quota type information */ }; /* Structure for communicating via ->set_info */ struct qc_info { int i_fieldmask; /* mask of fields to change in ->set_info() */ unsigned int i_flags; /* Flags QCI_* */ unsigned int i_spc_timelimit; /* Time after which space softlimit is * enforced */ unsigned int i_ino_timelimit; /* Ditto for inode softlimit */ unsigned int i_rt_spc_timelimit;/* Ditto for real-time space */ unsigned int i_spc_warnlimit; /* Limit for number of space warnings */ unsigned int i_ino_warnlimit; /* Limit for number of inode warnings */ unsigned int i_rt_spc_warnlimit; /* Ditto for real-time space */ }; /* Operations handling requests from userspace */ struct quotactl_ops { int (*quota_on)(struct super_block *, int, int, const struct path *); int (*quota_off)(struct super_block *, int); int (*quota_enable)(struct super_block *, unsigned int); int (*quota_disable)(struct super_block *, unsigned int); int (*quota_sync)(struct super_block *, int); int (*set_info)(struct super_block *, int, struct qc_info *); int (*get_dqblk)(struct super_block *, struct kqid, struct qc_dqblk *); int (*get_nextdqblk)(struct super_block *, struct kqid *, struct qc_dqblk *); int (*set_dqblk)(struct super_block *, struct kqid, struct qc_dqblk *); int (*get_state)(struct super_block *, struct qc_state *); int (*rm_xquota)(struct super_block *, unsigned int); }; struct quota_format_type { int qf_fmt_id; /* Quota format id */ const struct quota_format_ops *qf_ops; /* Operations of format */ struct module *qf_owner; /* Module implementing quota format */ struct quota_format_type *qf_next; }; /** * Quota state flags - they come in three flavors - for users, groups and projects. * * Actual typed flags layout: * USRQUOTA GRPQUOTA PRJQUOTA * DQUOT_USAGE_ENABLED 0x0001 0x0002 0x0004 * DQUOT_LIMITS_ENABLED 0x0008 0x0010 0x0020 * DQUOT_SUSPENDED 0x0040 0x0080 0x0100 * * Following bits are used for non-typed flags: * DQUOT_QUOTA_SYS_FILE 0x0200 * DQUOT_NEGATIVE_USAGE 0x0400 * DQUOT_NOLIST_DIRTY 0x0800 */ enum { _DQUOT_USAGE_ENABLED = 0, /* Track disk usage for users */ _DQUOT_LIMITS_ENABLED, /* Enforce quota limits for users */ _DQUOT_SUSPENDED, /* User diskquotas are off, but * we have necessary info in * memory to turn them on */ _DQUOT_STATE_FLAGS }; #define DQUOT_USAGE_ENABLED (1 << _DQUOT_USAGE_ENABLED * MAXQUOTAS) #define DQUOT_LIMITS_ENABLED (1 << _DQUOT_LIMITS_ENABLED * MAXQUOTAS) #define DQUOT_SUSPENDED (1 << _DQUOT_SUSPENDED * MAXQUOTAS) #define DQUOT_STATE_FLAGS (DQUOT_USAGE_ENABLED | DQUOT_LIMITS_ENABLED | \ DQUOT_SUSPENDED) /* Other quota flags */ #define DQUOT_STATE_LAST (_DQUOT_STATE_FLAGS * MAXQUOTAS) #define DQUOT_QUOTA_SYS_FILE (1 << DQUOT_STATE_LAST) /* Quota file is a special * system file and user cannot * touch it. Filesystem is * responsible for setting * S_NOQUOTA, S_NOATIME flags */ #define DQUOT_NEGATIVE_USAGE (1 << (DQUOT_STATE_LAST + 1)) /* Allow negative quota usage */ /* Do not track dirty dquots in a list */ #define DQUOT_NOLIST_DIRTY (1 << (DQUOT_STATE_LAST + 2)) static inline unsigned int dquot_state_flag(unsigned int flags, int type) { return flags << type; } static inline unsigned int dquot_generic_flag(unsigned int flags, int type) { return (flags >> type) & DQUOT_STATE_FLAGS; } /* Bitmap of quota types where flag is set in flags */ static __always_inline unsigned dquot_state_types(unsigned flags, unsigned flag) { BUILD_BUG_ON_NOT_POWER_OF_2(flag); return (flags / flag) & ((1 << MAXQUOTAS) - 1); } #ifdef CONFIG_QUOTA_NETLINK_INTERFACE extern void quota_send_warning(struct kqid qid, dev_t dev, const char warntype); #else static inline void quota_send_warning(struct kqid qid, dev_t dev, const char warntype) { return; } #endif /* CONFIG_QUOTA_NETLINK_INTERFACE */ struct quota_info { unsigned int flags; /* Flags for diskquotas on this device */ struct rw_semaphore dqio_sem; /* Lock quota file while I/O in progress */ struct inode *files[MAXQUOTAS]; /* inodes of quotafiles */ struct mem_dqinfo info[MAXQUOTAS]; /* Information for each quota type */ const struct quota_format_ops *ops[MAXQUOTAS]; /* Operations for each type */ }; void register_quota_format(struct quota_format_type *fmt); void unregister_quota_format(struct quota_format_type *fmt); struct quota_module_name { int qm_fmt_id; char *qm_mod_name; }; #define INIT_QUOTA_MODULE_NAMES {\ {QFMT_VFS_OLD, "quota_v1"},\ {QFMT_VFS_V0, "quota_v2"},\ {QFMT_VFS_V1, "quota_v2"},\ {0, NULL}} #endif /* _QUOTA_ */ |
| 267 5 262 260 176 168 15 15 2 486 486 4 4 486 486 4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 | // SPDX-License-Identifier: GPL-2.0-or-later /* * ip_vs_proto_tcp.c: TCP load balancing support for IPVS * * Authors: Wensong Zhang <wensong@linuxvirtualserver.org> * Julian Anastasov <ja@ssi.bg> * * Changes: Hans Schillstrom <hans.schillstrom@ericsson.com> * * Network name space (netns) aware. * Global data moved to netns i.e struct netns_ipvs * tcp_timeouts table has copy per netns in a hash table per * protocol ip_vs_proto_data and is handled by netns */ #define KMSG_COMPONENT "IPVS" #define pr_fmt(fmt) KMSG_COMPONENT ": " fmt #include <linux/kernel.h> #include <linux/ip.h> #include <linux/tcp.h> /* for tcphdr */ #include <net/ip.h> #include <net/tcp.h> /* for csum_tcpudp_magic */ #include <net/ip6_checksum.h> #include <linux/netfilter.h> #include <linux/netfilter_ipv4.h> #include <linux/indirect_call_wrapper.h> #include <net/ip_vs.h> static int tcp_csum_check(int af, struct sk_buff *skb, struct ip_vs_protocol *pp); static int tcp_conn_schedule(struct netns_ipvs *ipvs, int af, struct sk_buff *skb, struct ip_vs_proto_data *pd, int *verdict, struct ip_vs_conn **cpp, struct ip_vs_iphdr *iph) { struct ip_vs_service *svc; struct tcphdr _tcph, *th; __be16 _ports[2], *ports = NULL; /* In the event of icmp, we're only guaranteed to have the first 8 * bytes of the transport header, so we only check the rest of the * TCP packet for non-ICMP packets */ if (likely(!ip_vs_iph_icmp(iph))) { th = skb_header_pointer(skb, iph->len, sizeof(_tcph), &_tcph); if (th) { if (th->rst || !(sysctl_sloppy_tcp(ipvs) || th->syn)) return 1; ports = &th->source; } } else { ports = skb_header_pointer( skb, iph->len, sizeof(_ports), &_ports); } if (!ports) { *verdict = NF_DROP; return 0; } /* No !th->ack check to allow scheduling on SYN+ACK for Active FTP */ if (likely(!ip_vs_iph_inverse(iph))) svc = ip_vs_service_find(ipvs, af, skb->mark, iph->protocol, &iph->daddr, ports[1]); else svc = ip_vs_service_find(ipvs, af, skb->mark, iph->protocol, &iph->saddr, ports[0]); if (svc) { int ignored; if (ip_vs_todrop(ipvs)) { /* * It seems that we are very loaded. * We have to drop this packet :( */ *verdict = NF_DROP; return 0; } /* * Let the virtual server select a real server for the * incoming connection, and create a connection entry. */ *cpp = ip_vs_schedule(svc, skb, pd, &ignored, iph); if (!*cpp && ignored <= 0) { if (!ignored) *verdict = ip_vs_leave(svc, skb, pd, iph); else *verdict = NF_DROP; return 0; } } /* NF_ACCEPT */ return 1; } static inline void tcp_fast_csum_update(int af, struct tcphdr *tcph, const union nf_inet_addr *oldip, const union nf_inet_addr *newip, __be16 oldport, __be16 newport) { #ifdef CONFIG_IP_VS_IPV6 if (af == AF_INET6) tcph->check = csum_fold(ip_vs_check_diff16(oldip->ip6, newip->ip6, ip_vs_check_diff2(oldport, newport, ~csum_unfold(tcph->check)))); else #endif tcph->check = csum_fold(ip_vs_check_diff4(oldip->ip, newip->ip, ip_vs_check_diff2(oldport, newport, ~csum_unfold(tcph->check)))); } static inline void tcp_partial_csum_update(int af, struct tcphdr *tcph, const union nf_inet_addr *oldip, const union nf_inet_addr *newip, __be16 oldlen, __be16 newlen) { #ifdef CONFIG_IP_VS_IPV6 if (af == AF_INET6) tcph->check = ~csum_fold(ip_vs_check_diff16(oldip->ip6, newip->ip6, ip_vs_check_diff2(oldlen, newlen, csum_unfold(tcph->check)))); else #endif tcph->check = ~csum_fold(ip_vs_check_diff4(oldip->ip, newip->ip, ip_vs_check_diff2(oldlen, newlen, csum_unfold(tcph->check)))); } INDIRECT_CALLABLE_SCOPE int tcp_snat_handler(struct sk_buff *skb, struct ip_vs_protocol *pp, struct ip_vs_conn *cp, struct ip_vs_iphdr *iph) { struct tcphdr *tcph; unsigned int tcphoff = iph->len; bool payload_csum = false; int oldlen; #ifdef CONFIG_IP_VS_IPV6 if (cp->af == AF_INET6 && iph->fragoffs) return 1; #endif oldlen = skb->len - tcphoff; /* csum_check requires unshared skb */ if (skb_ensure_writable(skb, tcphoff + sizeof(*tcph))) return 0; if (unlikely(cp->app != NULL)) { int ret; /* Some checks before mangling */ if (!tcp_csum_check(cp->af, skb, pp)) return 0; /* Call application helper if needed */ if (!(ret = ip_vs_app_pkt_out(cp, skb, iph))) return 0; /* ret=2: csum update is needed after payload mangling */ if (ret == 1) oldlen = skb->len - tcphoff; else payload_csum = true; } tcph = (void *)skb_network_header(skb) + tcphoff; tcph->source = cp->vport; /* Adjust TCP checksums */ if (skb->ip_summed == CHECKSUM_PARTIAL) { tcp_partial_csum_update(cp->af, tcph, &cp->daddr, &cp->vaddr, htons(oldlen), htons(skb->len - tcphoff)); } else if (!payload_csum) { /* Only port and addr are changed, do fast csum update */ tcp_fast_csum_update(cp->af, tcph, &cp->daddr, &cp->vaddr, cp->dport, cp->vport); if (skb->ip_summed == CHECKSUM_COMPLETE) skb->ip_summed = cp->app ? CHECKSUM_UNNECESSARY : CHECKSUM_NONE; } else { /* full checksum calculation */ tcph->check = 0; skb->csum = skb_checksum(skb, tcphoff, skb->len - tcphoff, 0); #ifdef CONFIG_IP_VS_IPV6 if (cp->af == AF_INET6) tcph->check = csum_ipv6_magic(&cp->vaddr.in6, &cp->caddr.in6, skb->len - tcphoff, cp->protocol, skb->csum); else #endif tcph->check = csum_tcpudp_magic(cp->vaddr.ip, cp->caddr.ip, skb->len - tcphoff, cp->protocol, skb->csum); skb->ip_summed = CHECKSUM_UNNECESSARY; IP_VS_DBG(11, "O-pkt: %s O-csum=%d (+%zd)\n", pp->name, tcph->check, (char*)&(tcph->check) - (char*)tcph); } return 1; } static int tcp_dnat_handler(struct sk_buff *skb, struct ip_vs_protocol *pp, struct ip_vs_conn *cp, struct ip_vs_iphdr *iph) { struct tcphdr *tcph; unsigned int tcphoff = iph->len; bool payload_csum = false; int oldlen; #ifdef CONFIG_IP_VS_IPV6 if (cp->af == AF_INET6 && iph->fragoffs) return 1; #endif oldlen = skb->len - tcphoff; /* csum_check requires unshared skb */ if (skb_ensure_writable(skb, tcphoff + sizeof(*tcph))) return 0; if (unlikely(cp->app != NULL)) { int ret; /* Some checks before mangling */ if (!tcp_csum_check(cp->af, skb, pp)) return 0; /* * Attempt ip_vs_app call. * It will fix ip_vs_conn and iph ack_seq stuff */ if (!(ret = ip_vs_app_pkt_in(cp, skb, iph))) return 0; /* ret=2: csum update is needed after payload mangling */ if (ret == 1) oldlen = skb->len - tcphoff; else payload_csum = true; } tcph = (void *)skb_network_header(skb) + tcphoff; tcph->dest = cp->dport; /* * Adjust TCP checksums */ if (skb->ip_summed == CHECKSUM_PARTIAL) { tcp_partial_csum_update(cp->af, tcph, &cp->vaddr, &cp->daddr, htons(oldlen), htons(skb->len - tcphoff)); } else if (!payload_csum) { /* Only port and addr are changed, do fast csum update */ tcp_fast_csum_update(cp->af, tcph, &cp->vaddr, &cp->daddr, cp->vport, cp->dport); if (skb->ip_summed == CHECKSUM_COMPLETE) skb->ip_summed = cp->app ? CHECKSUM_UNNECESSARY : CHECKSUM_NONE; } else { /* full checksum calculation */ tcph->check = 0; skb->csum = skb_checksum(skb, tcphoff, skb->len - tcphoff, 0); #ifdef CONFIG_IP_VS_IPV6 if (cp->af == AF_INET6) tcph->check = csum_ipv6_magic(&cp->caddr.in6, &cp->daddr.in6, skb->len - tcphoff, cp->protocol, skb->csum); else #endif tcph->check = csum_tcpudp_magic(cp->caddr.ip, cp->daddr.ip, skb->len - tcphoff, cp->protocol, skb->csum); skb->ip_summed = CHECKSUM_UNNECESSARY; } return 1; } static int tcp_csum_check(int af, struct sk_buff *skb, struct ip_vs_protocol *pp) { unsigned int tcphoff; #ifdef CONFIG_IP_VS_IPV6 if (af == AF_INET6) tcphoff = sizeof(struct ipv6hdr); else #endif tcphoff = ip_hdrlen(skb); switch (skb->ip_summed) { case CHECKSUM_NONE: skb->csum = skb_checksum(skb, tcphoff, skb->len - tcphoff, 0); fallthrough; case CHECKSUM_COMPLETE: #ifdef CONFIG_IP_VS_IPV6 if (af == AF_INET6) { if (csum_ipv6_magic(&ipv6_hdr(skb)->saddr, &ipv6_hdr(skb)->daddr, skb->len - tcphoff, ipv6_hdr(skb)->nexthdr, skb->csum)) { IP_VS_DBG_RL_PKT(0, af, pp, skb, 0, "Failed checksum for"); return 0; } } else #endif if (csum_tcpudp_magic(ip_hdr(skb)->saddr, ip_hdr(skb)->daddr, skb->len - tcphoff, ip_hdr(skb)->protocol, skb->csum)) { IP_VS_DBG_RL_PKT(0, af, pp, skb, 0, "Failed checksum for"); return 0; } break; default: /* No need to checksum. */ break; } return 1; } #define TCP_DIR_INPUT 0 #define TCP_DIR_OUTPUT 4 #define TCP_DIR_INPUT_ONLY 8 static const int tcp_state_off[IP_VS_DIR_LAST] = { [IP_VS_DIR_INPUT] = TCP_DIR_INPUT, [IP_VS_DIR_OUTPUT] = TCP_DIR_OUTPUT, [IP_VS_DIR_INPUT_ONLY] = TCP_DIR_INPUT_ONLY, }; /* * Timeout table[state] */ static const int tcp_timeouts[IP_VS_TCP_S_LAST+1] = { [IP_VS_TCP_S_NONE] = 2*HZ, [IP_VS_TCP_S_ESTABLISHED] = 15*60*HZ, [IP_VS_TCP_S_SYN_SENT] = 2*60*HZ, [IP_VS_TCP_S_SYN_RECV] = 1*60*HZ, [IP_VS_TCP_S_FIN_WAIT] = 2*60*HZ, [IP_VS_TCP_S_TIME_WAIT] = 2*60*HZ, [IP_VS_TCP_S_CLOSE] = 10*HZ, [IP_VS_TCP_S_CLOSE_WAIT] = 60*HZ, [IP_VS_TCP_S_LAST_ACK] = 30*HZ, [IP_VS_TCP_S_LISTEN] = 2*60*HZ, [IP_VS_TCP_S_SYNACK] = 120*HZ, [IP_VS_TCP_S_LAST] = 2*HZ, }; static const char *const tcp_state_name_table[IP_VS_TCP_S_LAST+1] = { [IP_VS_TCP_S_NONE] = "NONE", [IP_VS_TCP_S_ESTABLISHED] = "ESTABLISHED", [IP_VS_TCP_S_SYN_SENT] = "SYN_SENT", [IP_VS_TCP_S_SYN_RECV] = "SYN_RECV", [IP_VS_TCP_S_FIN_WAIT] = "FIN_WAIT", [IP_VS_TCP_S_TIME_WAIT] = "TIME_WAIT", [IP_VS_TCP_S_CLOSE] = "CLOSE", [IP_VS_TCP_S_CLOSE_WAIT] = "CLOSE_WAIT", [IP_VS_TCP_S_LAST_ACK] = "LAST_ACK", [IP_VS_TCP_S_LISTEN] = "LISTEN", [IP_VS_TCP_S_SYNACK] = "SYNACK", [IP_VS_TCP_S_LAST] = "BUG!", }; static const bool tcp_state_active_table[IP_VS_TCP_S_LAST] = { [IP_VS_TCP_S_NONE] = false, [IP_VS_TCP_S_ESTABLISHED] = true, [IP_VS_TCP_S_SYN_SENT] = true, [IP_VS_TCP_S_SYN_RECV] = true, [IP_VS_TCP_S_FIN_WAIT] = false, [IP_VS_TCP_S_TIME_WAIT] = false, [IP_VS_TCP_S_CLOSE] = false, [IP_VS_TCP_S_CLOSE_WAIT] = false, [IP_VS_TCP_S_LAST_ACK] = false, [IP_VS_TCP_S_LISTEN] = false, [IP_VS_TCP_S_SYNACK] = true, }; #define sNO IP_VS_TCP_S_NONE #define sES IP_VS_TCP_S_ESTABLISHED #define sSS IP_VS_TCP_S_SYN_SENT #define sSR IP_VS_TCP_S_SYN_RECV #define sFW IP_VS_TCP_S_FIN_WAIT #define sTW IP_VS_TCP_S_TIME_WAIT #define sCL IP_VS_TCP_S_CLOSE #define sCW IP_VS_TCP_S_CLOSE_WAIT #define sLA IP_VS_TCP_S_LAST_ACK #define sLI IP_VS_TCP_S_LISTEN #define sSA IP_VS_TCP_S_SYNACK struct tcp_states_t { int next_state[IP_VS_TCP_S_LAST]; }; static const char * tcp_state_name(int state) { if (state >= IP_VS_TCP_S_LAST) return "ERR!"; return tcp_state_name_table[state] ? tcp_state_name_table[state] : "?"; } static bool tcp_state_active(int state) { if (state >= IP_VS_TCP_S_LAST) return false; return tcp_state_active_table[state]; } static struct tcp_states_t tcp_states[] = { /* INPUT */ /* sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA */ /*syn*/ {{sSR, sES, sES, sSR, sSR, sSR, sSR, sSR, sSR, sSR, sSR }}, /*fin*/ {{sCL, sCW, sSS, sTW, sTW, sTW, sCL, sCW, sLA, sLI, sTW }}, /*ack*/ {{sES, sES, sSS, sES, sFW, sTW, sCL, sCW, sCL, sLI, sES }}, /*rst*/ {{sCL, sCL, sCL, sSR, sCL, sCL, sCL, sCL, sLA, sLI, sSR }}, /* OUTPUT */ /* sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA */ /*syn*/ {{sSS, sES, sSS, sSR, sSS, sSS, sSS, sSS, sSS, sLI, sSR }}, /*fin*/ {{sTW, sFW, sSS, sTW, sFW, sTW, sCL, sTW, sLA, sLI, sTW }}, /*ack*/ {{sES, sES, sSS, sES, sFW, sTW, sCL, sCW, sLA, sES, sES }}, /*rst*/ {{sCL, sCL, sSS, sCL, sCL, sTW, sCL, sCL, sCL, sCL, sCL }}, /* INPUT-ONLY */ /* sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA */ /*syn*/ {{sSR, sES, sES, sSR, sSR, sSR, sSR, sSR, sSR, sSR, sSR }}, /*fin*/ {{sCL, sFW, sSS, sTW, sFW, sTW, sCL, sCW, sLA, sLI, sTW }}, /*ack*/ {{sES, sES, sSS, sES, sFW, sTW, sCL, sCW, sCL, sLI, sES }}, /*rst*/ {{sCL, sCL, sCL, sSR, sCL, sCL, sCL, sCL, sLA, sLI, sCL }}, }; static struct tcp_states_t tcp_states_dos[] = { /* INPUT */ /* sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA */ /*syn*/ {{sSR, sES, sES, sSR, sSR, sSR, sSR, sSR, sSR, sSR, sSA }}, /*fin*/ {{sCL, sCW, sSS, sTW, sTW, sTW, sCL, sCW, sLA, sLI, sSA }}, /*ack*/ {{sES, sES, sSS, sSR, sFW, sTW, sCL, sCW, sCL, sLI, sSA }}, /*rst*/ {{sCL, sCL, sCL, sSR, sCL, sCL, sCL, sCL, sLA, sLI, sCL }}, /* OUTPUT */ /* sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA */ /*syn*/ {{sSS, sES, sSS, sSA, sSS, sSS, sSS, sSS, sSS, sLI, sSA }}, /*fin*/ {{sTW, sFW, sSS, sTW, sFW, sTW, sCL, sTW, sLA, sLI, sTW }}, /*ack*/ {{sES, sES, sSS, sES, sFW, sTW, sCL, sCW, sLA, sES, sES }}, /*rst*/ {{sCL, sCL, sSS, sCL, sCL, sTW, sCL, sCL, sCL, sCL, sCL }}, /* INPUT-ONLY */ /* sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA */ /*syn*/ {{sSA, sES, sES, sSR, sSA, sSA, sSA, sSA, sSA, sSA, sSA }}, /*fin*/ {{sCL, sFW, sSS, sTW, sFW, sTW, sCL, sCW, sLA, sLI, sTW }}, /*ack*/ {{sES, sES, sSS, sES, sFW, sTW, sCL, sCW, sCL, sLI, sES }}, /*rst*/ {{sCL, sCL, sCL, sSR, sCL, sCL, sCL, sCL, sLA, sLI, sCL }}, }; static void tcp_timeout_change(struct ip_vs_proto_data *pd, int flags) { int on = (flags & 1); /* secure_tcp */ /* ** FIXME: change secure_tcp to independent sysctl var ** or make it per-service or per-app because it is valid ** for most if not for all of the applications. Something ** like "capabilities" (flags) for each object. */ pd->tcp_state_table = (on ? tcp_states_dos : tcp_states); } static inline int tcp_state_idx(struct tcphdr *th) { if (th->rst) return 3; if (th->syn) return 0; if (th->fin) return 1; if (th->ack) return 2; return -1; } static inline void set_tcp_state(struct ip_vs_proto_data *pd, struct ip_vs_conn *cp, int direction, struct tcphdr *th) { int state_idx; int new_state = IP_VS_TCP_S_CLOSE; int state_off = tcp_state_off[direction]; /* * Update state offset to INPUT_ONLY if necessary * or delete NO_OUTPUT flag if output packet detected */ if (cp->flags & IP_VS_CONN_F_NOOUTPUT) { if (state_off == TCP_DIR_OUTPUT) cp->flags &= ~IP_VS_CONN_F_NOOUTPUT; else state_off = TCP_DIR_INPUT_ONLY; } if ((state_idx = tcp_state_idx(th)) < 0) { IP_VS_DBG(8, "tcp_state_idx=%d!!!\n", state_idx); goto tcp_state_out; } new_state = pd->tcp_state_table[state_off+state_idx].next_state[cp->state]; tcp_state_out: if (new_state != cp->state) { struct ip_vs_dest *dest = cp->dest; IP_VS_DBG_BUF(8, "%s %s [%c%c%c%c] c:%s:%d v:%s:%d " "d:%s:%d state: %s->%s conn->refcnt:%d\n", pd->pp->name, ((state_off == TCP_DIR_OUTPUT) ? "output " : "input "), th->syn ? 'S' : '.', th->fin ? 'F' : '.', th->ack ? 'A' : '.', th->rst ? 'R' : '.', IP_VS_DBG_ADDR(cp->af, &cp->caddr), ntohs(cp->cport), IP_VS_DBG_ADDR(cp->af, &cp->vaddr), ntohs(cp->vport), IP_VS_DBG_ADDR(cp->daf, &cp->daddr), ntohs(cp->dport), tcp_state_name(cp->state), tcp_state_name(new_state), refcount_read(&cp->refcnt)); if (dest) { if (!(cp->flags & IP_VS_CONN_F_INACTIVE) && !tcp_state_active(new_state)) { atomic_dec(&dest->activeconns); atomic_inc(&dest->inactconns); cp->flags |= IP_VS_CONN_F_INACTIVE; } else if ((cp->flags & IP_VS_CONN_F_INACTIVE) && tcp_state_active(new_state)) { atomic_inc(&dest->activeconns); atomic_dec(&dest->inactconns); cp->flags &= ~IP_VS_CONN_F_INACTIVE; } } if (new_state == IP_VS_TCP_S_ESTABLISHED) ip_vs_control_assure_ct(cp); } if (likely(pd)) cp->timeout = pd->timeout_table[cp->state = new_state]; else /* What to do ? */ cp->timeout = tcp_timeouts[cp->state = new_state]; } /* * Handle state transitions */ static void tcp_state_transition(struct ip_vs_conn *cp, int direction, const struct sk_buff *skb, struct ip_vs_proto_data *pd) { struct tcphdr _tcph, *th; #ifdef CONFIG_IP_VS_IPV6 int ihl = cp->af == AF_INET ? ip_hdrlen(skb) : sizeof(struct ipv6hdr); #else int ihl = ip_hdrlen(skb); #endif th = skb_header_pointer(skb, ihl, sizeof(_tcph), &_tcph); if (th == NULL) return; spin_lock_bh(&cp->lock); set_tcp_state(pd, cp, direction, th); spin_unlock_bh(&cp->lock); } static inline __u16 tcp_app_hashkey(__be16 port) { return (((__force u16)port >> TCP_APP_TAB_BITS) ^ (__force u16)port) & TCP_APP_TAB_MASK; } static int tcp_register_app(struct netns_ipvs *ipvs, struct ip_vs_app *inc) { struct ip_vs_app *i; __u16 hash; __be16 port = inc->port; int ret = 0; struct ip_vs_proto_data *pd = ip_vs_proto_data_get(ipvs, IPPROTO_TCP); hash = tcp_app_hashkey(port); list_for_each_entry(i, &ipvs->tcp_apps[hash], p_list) { if (i->port == port) { ret = -EEXIST; goto out; } } list_add_rcu(&inc->p_list, &ipvs->tcp_apps[hash]); atomic_inc(&pd->appcnt); out: return ret; } static void tcp_unregister_app(struct netns_ipvs *ipvs, struct ip_vs_app *inc) { struct ip_vs_proto_data *pd = ip_vs_proto_data_get(ipvs, IPPROTO_TCP); atomic_dec(&pd->appcnt); list_del_rcu(&inc->p_list); } static int tcp_app_conn_bind(struct ip_vs_conn *cp) { struct netns_ipvs *ipvs = cp->ipvs; int hash; struct ip_vs_app *inc; int result = 0; /* Default binding: bind app only for NAT */ if (IP_VS_FWD_METHOD(cp) != IP_VS_CONN_F_MASQ) return 0; /* Lookup application incarnations and bind the right one */ hash = tcp_app_hashkey(cp->vport); list_for_each_entry_rcu(inc, &ipvs->tcp_apps[hash], p_list) { if (inc->port == cp->vport) { if (unlikely(!ip_vs_app_inc_get(inc))) break; IP_VS_DBG_BUF(9, "%s(): Binding conn %s:%u->" "%s:%u to app %s on port %u\n", __func__, IP_VS_DBG_ADDR(cp->af, &cp->caddr), ntohs(cp->cport), IP_VS_DBG_ADDR(cp->af, &cp->vaddr), ntohs(cp->vport), inc->name, ntohs(inc->port)); cp->app = inc; if (inc->init_conn) result = inc->init_conn(inc, cp); break; } } return result; } /* * Set LISTEN timeout. (ip_vs_conn_put will setup timer) */ void ip_vs_tcp_conn_listen(struct ip_vs_conn *cp) { struct ip_vs_proto_data *pd = ip_vs_proto_data_get(cp->ipvs, IPPROTO_TCP); spin_lock_bh(&cp->lock); cp->state = IP_VS_TCP_S_LISTEN; cp->timeout = (pd ? pd->timeout_table[IP_VS_TCP_S_LISTEN] : tcp_timeouts[IP_VS_TCP_S_LISTEN]); spin_unlock_bh(&cp->lock); } /* --------------------------------------------- * timeouts is netns related now. * --------------------------------------------- */ static int __ip_vs_tcp_init(struct netns_ipvs *ipvs, struct ip_vs_proto_data *pd) { ip_vs_init_hash_table(ipvs->tcp_apps, TCP_APP_TAB_SIZE); pd->timeout_table = ip_vs_create_timeout_table((int *)tcp_timeouts, sizeof(tcp_timeouts)); if (!pd->timeout_table) return -ENOMEM; pd->tcp_state_table = tcp_states; return 0; } static void __ip_vs_tcp_exit(struct netns_ipvs *ipvs, struct ip_vs_proto_data *pd) { kfree(pd->timeout_table); } struct ip_vs_protocol ip_vs_protocol_tcp = { .name = "TCP", .protocol = IPPROTO_TCP, .num_states = IP_VS_TCP_S_LAST, .dont_defrag = 0, .init = NULL, .exit = NULL, .init_netns = __ip_vs_tcp_init, .exit_netns = __ip_vs_tcp_exit, .register_app = tcp_register_app, .unregister_app = tcp_unregister_app, .conn_schedule = tcp_conn_schedule, .conn_in_get = ip_vs_conn_in_get_proto, .conn_out_get = ip_vs_conn_out_get_proto, .snat_handler = tcp_snat_handler, .dnat_handler = tcp_dnat_handler, .state_name = tcp_state_name, .state_transition = tcp_state_transition, .app_conn_bind = tcp_app_conn_bind, .debug_packet = ip_vs_tcpudp_debug_packet, .timeout_change = tcp_timeout_change, }; |
| 2298 2585 1909 823 825 1175 1175 768 694 128 365 322 551 69 814 816 816 683 75 57 54 77 77 77 2035 1278 1279 90 1216 2035 1666 68 769 1279 1002 720 1123 80 501 189 26 30 267 474 39 37 2154 2140 1248 10 658 669 1200 675 2148 2149 1229 2138 2139 2138 2139 1282 124 734 2138 2135 2141 2139 2138 2139 1790 688 688 2139 2138 1701 1700 1700 1702 1068 2361 2363 2365 2204 500 1081 1649 1736 1893 8 3 2364 780 283 713 1505 1486 76 9 13 67 67 4 63 31 57 36 34 59 44 32 1 778 710 59 702 60 3 33 11 58 1 26 33 33 67 57 45 32 2599 2600 2596 612 160 486 1506 1633 288 126 383 383 622 647 331 843 144 292 187 356 1316 1653 430 2 75 28 49 1670 1505 1663 1663 1662 1663 1728 1728 359 9 10 1 8 2299 696 1 40 18 40 679 1 1 1 181 181 91 90 91 90 91 91 91 91 88 40 91 90 639 344 2 3 240 660 16 17 671 123 85 175 150 60 175 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 | // SPDX-License-Identifier: GPL-2.0 /* * fs/ext4/extents_status.c * * Written by Yongqiang Yang <xiaoqiangnk@gmail.com> * Modified by * Allison Henderson <achender@linux.vnet.ibm.com> * Hugh Dickins <hughd@google.com> * Zheng Liu <wenqing.lz@taobao.com> * * Ext4 extents status tree core functions. */ #include <linux/list_sort.h> #include <linux/proc_fs.h> #include <linux/seq_file.h> #include "ext4.h" #include <trace/events/ext4.h> /* * According to previous discussion in Ext4 Developer Workshop, we * will introduce a new structure called io tree to track all extent * status in order to solve some problems that we have met * (e.g. Reservation space warning), and provide extent-level locking. * Delay extent tree is the first step to achieve this goal. It is * original built by Yongqiang Yang. At that time it is called delay * extent tree, whose goal is only track delayed extents in memory to * simplify the implementation of fiemap and bigalloc, and introduce * lseek SEEK_DATA/SEEK_HOLE support. That is why it is still called * delay extent tree at the first commit. But for better understand * what it does, it has been rename to extent status tree. * * Step1: * Currently the first step has been done. All delayed extents are * tracked in the tree. It maintains the delayed extent when a delayed * allocation is issued, and the delayed extent is written out or * invalidated. Therefore the implementation of fiemap and bigalloc * are simplified, and SEEK_DATA/SEEK_HOLE are introduced. * * The following comment describes the implemenmtation of extent * status tree and future works. * * Step2: * In this step all extent status are tracked by extent status tree. * Thus, we can first try to lookup a block mapping in this tree before * finding it in extent tree. Hence, single extent cache can be removed * because extent status tree can do a better job. Extents in status * tree are loaded on-demand. Therefore, the extent status tree may not * contain all of the extents in a file. Meanwhile we define a shrinker * to reclaim memory from extent status tree because fragmented extent * tree will make status tree cost too much memory. written/unwritten/- * hole extents in the tree will be reclaimed by this shrinker when we * are under high memory pressure. Delayed extents will not be * reclimed because fiemap, bigalloc, and seek_data/hole need it. */ /* * Extent status tree implementation for ext4. * * * ========================================================================== * Extent status tree tracks all extent status. * * 1. Why we need to implement extent status tree? * * Without extent status tree, ext4 identifies a delayed extent by looking * up page cache, this has several deficiencies - complicated, buggy, * and inefficient code. * * FIEMAP, SEEK_HOLE/DATA, bigalloc, and writeout all need to know if a * block or a range of blocks are belonged to a delayed extent. * * Let us have a look at how they do without extent status tree. * -- FIEMAP * FIEMAP looks up page cache to identify delayed allocations from holes. * * -- SEEK_HOLE/DATA * SEEK_HOLE/DATA has the same problem as FIEMAP. * * -- bigalloc * bigalloc looks up page cache to figure out if a block is * already under delayed allocation or not to determine whether * quota reserving is needed for the cluster. * * -- writeout * Writeout looks up whole page cache to see if a buffer is * mapped, If there are not very many delayed buffers, then it is * time consuming. * * With extent status tree implementation, FIEMAP, SEEK_HOLE/DATA, * bigalloc and writeout can figure out if a block or a range of * blocks is under delayed allocation(belonged to a delayed extent) or * not by searching the extent tree. * * * ========================================================================== * 2. Ext4 extent status tree impelmentation * * -- extent * A extent is a range of blocks which are contiguous logically and * physically. Unlike extent in extent tree, this extent in ext4 is * a in-memory struct, there is no corresponding on-disk data. There * is no limit on length of extent, so an extent can contain as many * blocks as they are contiguous logically and physically. * * -- extent status tree * Every inode has an extent status tree and all allocation blocks * are added to the tree with different status. The extent in the * tree are ordered by logical block no. * * -- operations on a extent status tree * There are three important operations on a delayed extent tree: find * next extent, adding a extent(a range of blocks) and removing a extent. * * -- race on a extent status tree * Extent status tree is protected by inode->i_es_lock. * * -- memory consumption * Fragmented extent tree will make extent status tree cost too much * memory. Hence, we will reclaim written/unwritten/hole extents from * the tree under a heavy memory pressure. * * ========================================================================== * 3. Assurance of Ext4 extent status tree consistency * * When mapping blocks, Ext4 queries the extent status tree first and should * always trusts that the extent status tree is consistent and up to date. * Therefore, it is important to adheres to the following rules when createing, * modifying and removing extents. * * 1. Besides fastcommit replay, when Ext4 creates or queries block mappings, * the extent information should always be processed through the extent * status tree instead of being organized manually through the on-disk * extent tree. * * 2. When updating the extent tree, Ext4 should acquire the i_data_sem * exclusively and update the extent status tree atomically. If the extents * to be modified are large enough to exceed the range that a single * i_data_sem can process (as ext4_datasem_ensure_credits() may drop * i_data_sem to restart a transaction), it must (e.g. as ext4_punch_hole() * does): * * a) Hold the i_rwsem and invalidate_lock exclusively. This ensures * exclusion against page faults, as well as reads and writes that may * concurrently modify the extent status tree. * b) Evict all page cache in the affected range and recommend rebuilding * or dropping the extent status tree after modifying the on-disk * extent tree. This ensures exclusion against concurrent writebacks * that do not hold those locks but only holds a folio lock. * * 3. Based on the rules above, when querying block mappings, Ext4 should at * least hold the i_rwsem or invalidate_lock or folio lock(s) for the * specified querying range. * * ========================================================================== * 4. Performance analysis * * -- overhead * 1. There is a cache extent for write access, so if writes are * not very random, adding space operaions are in O(1) time. * * -- gain * 2. Code is much simpler, more readable, more maintainable and * more efficient. * * * ========================================================================== * 5. TODO list * * -- Refactor delayed space reservation * * -- Extent-level locking */ static struct kmem_cache *ext4_es_cachep; static struct kmem_cache *ext4_pending_cachep; static int __es_insert_extent(struct inode *inode, struct extent_status *newes, struct extent_status *prealloc); static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t end, int *reserved, struct extent_status *prealloc); static int es_reclaim_extents(struct ext4_inode_info *ei, int *nr_to_scan); static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan, struct ext4_inode_info *locked_ei); static int __revise_pending(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len, struct pending_reservation **prealloc); int __init ext4_init_es(void) { ext4_es_cachep = KMEM_CACHE(extent_status, SLAB_RECLAIM_ACCOUNT); if (ext4_es_cachep == NULL) return -ENOMEM; return 0; } void ext4_exit_es(void) { kmem_cache_destroy(ext4_es_cachep); } void ext4_es_init_tree(struct ext4_es_tree *tree) { tree->root = RB_ROOT; tree->cache_es = NULL; } #ifdef ES_DEBUG__ static void ext4_es_print_tree(struct inode *inode) { struct ext4_es_tree *tree; struct rb_node *node; printk(KERN_DEBUG "status extents for inode %lu:", inode->i_ino); tree = &EXT4_I(inode)->i_es_tree; node = rb_first(&tree->root); while (node) { struct extent_status *es; es = rb_entry(node, struct extent_status, rb_node); printk(KERN_DEBUG " [%u/%u) %llu %x", es->es_lblk, es->es_len, ext4_es_pblock(es), ext4_es_status(es)); node = rb_next(node); } printk(KERN_DEBUG "\n"); } #else #define ext4_es_print_tree(inode) #endif static inline ext4_lblk_t ext4_es_end(struct extent_status *es) { BUG_ON(es->es_lblk + es->es_len < es->es_lblk); return es->es_lblk + es->es_len - 1; } /* * search through the tree for an delayed extent with a given offset. If * it can't be found, try to find next extent. */ static struct extent_status *__es_tree_search(struct rb_root *root, ext4_lblk_t lblk) { struct rb_node *node = root->rb_node; struct extent_status *es = NULL; while (node) { es = rb_entry(node, struct extent_status, rb_node); if (lblk < es->es_lblk) node = node->rb_left; else if (lblk > ext4_es_end(es)) node = node->rb_right; else return es; } if (es && lblk < es->es_lblk) return es; if (es && lblk > ext4_es_end(es)) { node = rb_next(&es->rb_node); return node ? rb_entry(node, struct extent_status, rb_node) : NULL; } return NULL; } /* * ext4_es_find_extent_range - find extent with specified status within block * range or next extent following block range in * extents status tree * * @inode - file containing the range * @matching_fn - pointer to function that matches extents with desired status * @lblk - logical block defining start of range * @end - logical block defining end of range * @es - extent found, if any * * Find the first extent within the block range specified by @lblk and @end * in the extents status tree that satisfies @matching_fn. If a match * is found, it's returned in @es. If not, and a matching extent is found * beyond the block range, it's returned in @es. If no match is found, an * extent is returned in @es whose es_lblk, es_len, and es_pblk components * are 0. */ static void __es_find_extent_range(struct inode *inode, int (*matching_fn)(struct extent_status *es), ext4_lblk_t lblk, ext4_lblk_t end, struct extent_status *es) { struct ext4_es_tree *tree = NULL; struct extent_status *es1 = NULL; struct rb_node *node; WARN_ON(es == NULL); WARN_ON(end < lblk); tree = &EXT4_I(inode)->i_es_tree; /* see if the extent has been cached */ es->es_lblk = es->es_len = es->es_pblk = 0; es1 = READ_ONCE(tree->cache_es); if (es1 && in_range(lblk, es1->es_lblk, es1->es_len)) { es_debug("%u cached by [%u/%u) %llu %x\n", lblk, es1->es_lblk, es1->es_len, ext4_es_pblock(es1), ext4_es_status(es1)); goto out; } es1 = __es_tree_search(&tree->root, lblk); out: if (es1 && !matching_fn(es1)) { while ((node = rb_next(&es1->rb_node)) != NULL) { es1 = rb_entry(node, struct extent_status, rb_node); if (es1->es_lblk > end) { es1 = NULL; break; } if (matching_fn(es1)) break; } } if (es1 && matching_fn(es1)) { WRITE_ONCE(tree->cache_es, es1); es->es_lblk = es1->es_lblk; es->es_len = es1->es_len; es->es_pblk = es1->es_pblk; } } /* * Locking for __es_find_extent_range() for external use */ void ext4_es_find_extent_range(struct inode *inode, int (*matching_fn)(struct extent_status *es), ext4_lblk_t lblk, ext4_lblk_t end, struct extent_status *es) { es->es_lblk = es->es_len = es->es_pblk = 0; if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) return; trace_ext4_es_find_extent_range_enter(inode, lblk); read_lock(&EXT4_I(inode)->i_es_lock); __es_find_extent_range(inode, matching_fn, lblk, end, es); read_unlock(&EXT4_I(inode)->i_es_lock); trace_ext4_es_find_extent_range_exit(inode, es); } /* * __es_scan_range - search block range for block with specified status * in extents status tree * * @inode - file containing the range * @matching_fn - pointer to function that matches extents with desired status * @lblk - logical block defining start of range * @end - logical block defining end of range * * Returns true if at least one block in the specified block range satisfies * the criterion specified by @matching_fn, and false if not. If at least * one extent has the specified status, then there is at least one block * in the cluster with that status. Should only be called by code that has * taken i_es_lock. */ static bool __es_scan_range(struct inode *inode, int (*matching_fn)(struct extent_status *es), ext4_lblk_t start, ext4_lblk_t end) { struct extent_status es; __es_find_extent_range(inode, matching_fn, start, end, &es); if (es.es_len == 0) return false; /* no matching extent in the tree */ else if (es.es_lblk <= start && start < es.es_lblk + es.es_len) return true; else if (start <= es.es_lblk && es.es_lblk <= end) return true; else return false; } /* * Locking for __es_scan_range() for external use */ bool ext4_es_scan_range(struct inode *inode, int (*matching_fn)(struct extent_status *es), ext4_lblk_t lblk, ext4_lblk_t end) { bool ret; if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) return false; read_lock(&EXT4_I(inode)->i_es_lock); ret = __es_scan_range(inode, matching_fn, lblk, end); read_unlock(&EXT4_I(inode)->i_es_lock); return ret; } /* * __es_scan_clu - search cluster for block with specified status in * extents status tree * * @inode - file containing the cluster * @matching_fn - pointer to function that matches extents with desired status * @lblk - logical block in cluster to be searched * * Returns true if at least one extent in the cluster containing @lblk * satisfies the criterion specified by @matching_fn, and false if not. If at * least one extent has the specified status, then there is at least one block * in the cluster with that status. Should only be called by code that has * taken i_es_lock. */ static bool __es_scan_clu(struct inode *inode, int (*matching_fn)(struct extent_status *es), ext4_lblk_t lblk) { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); ext4_lblk_t lblk_start, lblk_end; lblk_start = EXT4_LBLK_CMASK(sbi, lblk); lblk_end = lblk_start + sbi->s_cluster_ratio - 1; return __es_scan_range(inode, matching_fn, lblk_start, lblk_end); } /* * Locking for __es_scan_clu() for external use */ bool ext4_es_scan_clu(struct inode *inode, int (*matching_fn)(struct extent_status *es), ext4_lblk_t lblk) { bool ret; if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) return false; read_lock(&EXT4_I(inode)->i_es_lock); ret = __es_scan_clu(inode, matching_fn, lblk); read_unlock(&EXT4_I(inode)->i_es_lock); return ret; } static void ext4_es_list_add(struct inode *inode) { struct ext4_inode_info *ei = EXT4_I(inode); struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); if (!list_empty(&ei->i_es_list)) return; spin_lock(&sbi->s_es_lock); if (list_empty(&ei->i_es_list)) { list_add_tail(&ei->i_es_list, &sbi->s_es_list); sbi->s_es_nr_inode++; } spin_unlock(&sbi->s_es_lock); } static void ext4_es_list_del(struct inode *inode) { struct ext4_inode_info *ei = EXT4_I(inode); struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); spin_lock(&sbi->s_es_lock); if (!list_empty(&ei->i_es_list)) { list_del_init(&ei->i_es_list); sbi->s_es_nr_inode--; WARN_ON_ONCE(sbi->s_es_nr_inode < 0); } spin_unlock(&sbi->s_es_lock); } static inline struct pending_reservation *__alloc_pending(bool nofail) { if (!nofail) return kmem_cache_alloc(ext4_pending_cachep, GFP_ATOMIC); return kmem_cache_zalloc(ext4_pending_cachep, GFP_KERNEL | __GFP_NOFAIL); } static inline void __free_pending(struct pending_reservation *pr) { kmem_cache_free(ext4_pending_cachep, pr); } /* * Returns true if we cannot fail to allocate memory for this extent_status * entry and cannot reclaim it until its status changes. */ static inline bool ext4_es_must_keep(struct extent_status *es) { /* fiemap, bigalloc, and seek_data/hole need to use it. */ if (ext4_es_is_delayed(es)) return true; return false; } static inline struct extent_status *__es_alloc_extent(bool nofail) { if (!nofail) return kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC); return kmem_cache_zalloc(ext4_es_cachep, GFP_KERNEL | __GFP_NOFAIL); } static void ext4_es_init_extent(struct inode *inode, struct extent_status *es, ext4_lblk_t lblk, ext4_lblk_t len, ext4_fsblk_t pblk) { es->es_lblk = lblk; es->es_len = len; es->es_pblk = pblk; /* We never try to reclaim a must kept extent, so we don't count it. */ if (!ext4_es_must_keep(es)) { if (!EXT4_I(inode)->i_es_shk_nr++) ext4_es_list_add(inode); percpu_counter_inc(&EXT4_SB(inode->i_sb)-> s_es_stats.es_stats_shk_cnt); } EXT4_I(inode)->i_es_all_nr++; percpu_counter_inc(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt); } static inline void __es_free_extent(struct extent_status *es) { kmem_cache_free(ext4_es_cachep, es); } static void ext4_es_free_extent(struct inode *inode, struct extent_status *es) { EXT4_I(inode)->i_es_all_nr--; percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt); /* Decrease the shrink counter when we can reclaim the extent. */ if (!ext4_es_must_keep(es)) { BUG_ON(EXT4_I(inode)->i_es_shk_nr == 0); if (!--EXT4_I(inode)->i_es_shk_nr) ext4_es_list_del(inode); percpu_counter_dec(&EXT4_SB(inode->i_sb)-> s_es_stats.es_stats_shk_cnt); } __es_free_extent(es); } /* * Check whether or not two extents can be merged * Condition: * - logical block number is contiguous * - physical block number is contiguous * - status is equal */ static int ext4_es_can_be_merged(struct extent_status *es1, struct extent_status *es2) { if (ext4_es_type(es1) != ext4_es_type(es2)) return 0; if (((__u64) es1->es_len) + es2->es_len > EXT_MAX_BLOCKS) { pr_warn("ES assertion failed when merging extents. " "The sum of lengths of es1 (%d) and es2 (%d) " "is bigger than allowed file size (%d)\n", es1->es_len, es2->es_len, EXT_MAX_BLOCKS); WARN_ON(1); return 0; } if (((__u64) es1->es_lblk) + es1->es_len != es2->es_lblk) return 0; if ((ext4_es_is_written(es1) || ext4_es_is_unwritten(es1)) && (ext4_es_pblock(es1) + es1->es_len == ext4_es_pblock(es2))) return 1; if (ext4_es_is_hole(es1)) return 1; /* we need to check delayed extent */ if (ext4_es_is_delayed(es1)) return 1; return 0; } static struct extent_status * ext4_es_try_to_merge_left(struct inode *inode, struct extent_status *es) { struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree; struct extent_status *es1; struct rb_node *node; node = rb_prev(&es->rb_node); if (!node) return es; es1 = rb_entry(node, struct extent_status, rb_node); if (ext4_es_can_be_merged(es1, es)) { es1->es_len += es->es_len; if (ext4_es_is_referenced(es)) ext4_es_set_referenced(es1); rb_erase(&es->rb_node, &tree->root); ext4_es_free_extent(inode, es); es = es1; } return es; } static struct extent_status * ext4_es_try_to_merge_right(struct inode *inode, struct extent_status *es) { struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree; struct extent_status *es1; struct rb_node *node; node = rb_next(&es->rb_node); if (!node) return es; es1 = rb_entry(node, struct extent_status, rb_node); if (ext4_es_can_be_merged(es, es1)) { es->es_len += es1->es_len; if (ext4_es_is_referenced(es1)) ext4_es_set_referenced(es); rb_erase(node, &tree->root); ext4_es_free_extent(inode, es1); } return es; } #ifdef ES_AGGRESSIVE_TEST #include "ext4_extents.h" /* Needed when ES_AGGRESSIVE_TEST is defined */ static void ext4_es_insert_extent_ext_check(struct inode *inode, struct extent_status *es) { struct ext4_ext_path *path = NULL; struct ext4_extent *ex; ext4_lblk_t ee_block; ext4_fsblk_t ee_start; unsigned short ee_len; int depth, ee_status, es_status; path = ext4_find_extent(inode, es->es_lblk, NULL, EXT4_EX_NOCACHE); if (IS_ERR(path)) return; depth = ext_depth(inode); ex = path[depth].p_ext; if (ex) { ee_block = le32_to_cpu(ex->ee_block); ee_start = ext4_ext_pblock(ex); ee_len = ext4_ext_get_actual_len(ex); ee_status = ext4_ext_is_unwritten(ex) ? 1 : 0; es_status = ext4_es_is_unwritten(es) ? 1 : 0; /* * Make sure ex and es are not overlap when we try to insert * a delayed/hole extent. */ if (!ext4_es_is_written(es) && !ext4_es_is_unwritten(es)) { if (in_range(es->es_lblk, ee_block, ee_len)) { pr_warn("ES insert assertion failed for " "inode: %lu we can find an extent " "at block [%d/%d/%llu/%c], but we " "want to add a delayed/hole extent " "[%d/%d/%llu/%x]\n", inode->i_ino, ee_block, ee_len, ee_start, ee_status ? 'u' : 'w', es->es_lblk, es->es_len, ext4_es_pblock(es), ext4_es_status(es)); } goto out; } /* * We don't check ee_block == es->es_lblk, etc. because es * might be a part of whole extent, vice versa. */ if (es->es_lblk < ee_block || ext4_es_pblock(es) != ee_start + es->es_lblk - ee_block) { pr_warn("ES insert assertion failed for inode: %lu " "ex_status [%d/%d/%llu/%c] != " "es_status [%d/%d/%llu/%c]\n", inode->i_ino, ee_block, ee_len, ee_start, ee_status ? 'u' : 'w', es->es_lblk, es->es_len, ext4_es_pblock(es), es_status ? 'u' : 'w'); goto out; } if (ee_status ^ es_status) { pr_warn("ES insert assertion failed for inode: %lu " "ex_status [%d/%d/%llu/%c] != " "es_status [%d/%d/%llu/%c]\n", inode->i_ino, ee_block, ee_len, ee_start, ee_status ? 'u' : 'w', es->es_lblk, es->es_len, ext4_es_pblock(es), es_status ? 'u' : 'w'); } } else { /* * We can't find an extent on disk. So we need to make sure * that we don't want to add an written/unwritten extent. */ if (!ext4_es_is_delayed(es) && !ext4_es_is_hole(es)) { pr_warn("ES insert assertion failed for inode: %lu " "can't find an extent at block %d but we want " "to add a written/unwritten extent " "[%d/%d/%llu/%x]\n", inode->i_ino, es->es_lblk, es->es_lblk, es->es_len, ext4_es_pblock(es), ext4_es_status(es)); } } out: ext4_free_ext_path(path); } static void ext4_es_insert_extent_ind_check(struct inode *inode, struct extent_status *es) { struct ext4_map_blocks map; int retval; /* * Here we call ext4_ind_map_blocks to lookup a block mapping because * 'Indirect' structure is defined in indirect.c. So we couldn't * access direct/indirect tree from outside. It is too dirty to define * this function in indirect.c file. */ map.m_lblk = es->es_lblk; map.m_len = es->es_len; retval = ext4_ind_map_blocks(NULL, inode, &map, 0); if (retval > 0) { if (ext4_es_is_delayed(es) || ext4_es_is_hole(es)) { /* * We want to add a delayed/hole extent but this * block has been allocated. */ pr_warn("ES insert assertion failed for inode: %lu " "We can find blocks but we want to add a " "delayed/hole extent [%d/%d/%llu/%x]\n", inode->i_ino, es->es_lblk, es->es_len, ext4_es_pblock(es), ext4_es_status(es)); return; } else if (ext4_es_is_written(es)) { if (retval != es->es_len) { pr_warn("ES insert assertion failed for " "inode: %lu retval %d != es_len %d\n", inode->i_ino, retval, es->es_len); return; } if (map.m_pblk != ext4_es_pblock(es)) { pr_warn("ES insert assertion failed for " "inode: %lu m_pblk %llu != " "es_pblk %llu\n", inode->i_ino, map.m_pblk, ext4_es_pblock(es)); return; } } else { /* * We don't need to check unwritten extent because * indirect-based file doesn't have it. */ BUG(); } } else if (retval == 0) { if (ext4_es_is_written(es)) { pr_warn("ES insert assertion failed for inode: %lu " "We can't find the block but we want to add " "a written extent [%d/%d/%llu/%x]\n", inode->i_ino, es->es_lblk, es->es_len, ext4_es_pblock(es), ext4_es_status(es)); return; } } } static inline void ext4_es_insert_extent_check(struct inode *inode, struct extent_status *es) { /* * We don't need to worry about the race condition because * caller takes i_data_sem locking. */ BUG_ON(!rwsem_is_locked(&EXT4_I(inode)->i_data_sem)); if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) ext4_es_insert_extent_ext_check(inode, es); else ext4_es_insert_extent_ind_check(inode, es); } #else static inline void ext4_es_insert_extent_check(struct inode *inode, struct extent_status *es) { } #endif static int __es_insert_extent(struct inode *inode, struct extent_status *newes, struct extent_status *prealloc) { struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree; struct rb_node **p = &tree->root.rb_node; struct rb_node *parent = NULL; struct extent_status *es; while (*p) { parent = *p; es = rb_entry(parent, struct extent_status, rb_node); if (newes->es_lblk < es->es_lblk) { if (ext4_es_can_be_merged(newes, es)) { /* * Here we can modify es_lblk directly * because it isn't overlapped. */ es->es_lblk = newes->es_lblk; es->es_len += newes->es_len; if (ext4_es_is_written(es) || ext4_es_is_unwritten(es)) ext4_es_store_pblock(es, newes->es_pblk); es = ext4_es_try_to_merge_left(inode, es); goto out; } p = &(*p)->rb_left; } else if (newes->es_lblk > ext4_es_end(es)) { if (ext4_es_can_be_merged(es, newes)) { es->es_len += newes->es_len; es = ext4_es_try_to_merge_right(inode, es); goto out; } p = &(*p)->rb_right; } else { BUG(); return -EINVAL; } } if (prealloc) es = prealloc; else es = __es_alloc_extent(false); if (!es) return -ENOMEM; ext4_es_init_extent(inode, es, newes->es_lblk, newes->es_len, newes->es_pblk); rb_link_node(&es->rb_node, parent, p); rb_insert_color(&es->rb_node, &tree->root); out: tree->cache_es = es; return 0; } /* * ext4_es_insert_extent() adds information to an inode's extent * status tree. */ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len, ext4_fsblk_t pblk, unsigned int status, bool delalloc_reserve_used) { struct extent_status newes; ext4_lblk_t end = lblk + len - 1; int err1 = 0, err2 = 0, err3 = 0; int resv_used = 0, pending = 0; struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); struct extent_status *es1 = NULL; struct extent_status *es2 = NULL; struct pending_reservation *pr = NULL; bool revise_pending = false; if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) return; es_debug("add [%u/%u) %llu %x %d to extent status tree of inode %lu\n", lblk, len, pblk, status, delalloc_reserve_used, inode->i_ino); if (!len) return; BUG_ON(end < lblk); WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED); newes.es_lblk = lblk; newes.es_len = len; ext4_es_store_pblock_status(&newes, pblk, status); trace_ext4_es_insert_extent(inode, &newes); ext4_es_insert_extent_check(inode, &newes); revise_pending = sbi->s_cluster_ratio > 1 && test_opt(inode->i_sb, DELALLOC) && (status & (EXTENT_STATUS_WRITTEN | EXTENT_STATUS_UNWRITTEN)); retry: if (err1 && !es1) es1 = __es_alloc_extent(true); if ((err1 || err2) && !es2) es2 = __es_alloc_extent(true); if ((err1 || err2 || err3 < 0) && revise_pending && !pr) pr = __alloc_pending(true); write_lock(&EXT4_I(inode)->i_es_lock); err1 = __es_remove_extent(inode, lblk, end, &resv_used, es1); if (err1 != 0) goto error; /* Free preallocated extent if it didn't get used. */ if (es1) { if (!es1->es_len) __es_free_extent(es1); es1 = NULL; } err2 = __es_insert_extent(inode, &newes, es2); if (err2 == -ENOMEM && !ext4_es_must_keep(&newes)) err2 = 0; if (err2 != 0) goto error; /* Free preallocated extent if it didn't get used. */ if (es2) { if (!es2->es_len) __es_free_extent(es2); es2 = NULL; } if (revise_pending) { err3 = __revise_pending(inode, lblk, len, &pr); if (err3 < 0) goto error; if (pr) { __free_pending(pr); pr = NULL; } pending = err3; } error: write_unlock(&EXT4_I(inode)->i_es_lock); /* * Reduce the reserved cluster count to reflect successful deferred * allocation of delayed allocated clusters or direct allocation of * clusters discovered to be delayed allocated. Once allocated, a * cluster is not included in the reserved count. * * When direct allocating (from fallocate, filemap, DIO, or clusters * allocated when delalloc has been disabled by ext4_nonda_switch()) * an extent either 1) contains delayed blocks but start with * non-delayed allocated blocks (e.g. hole) or 2) contains non-delayed * allocated blocks which belong to delayed allocated clusters when * bigalloc feature is enabled, quota has already been claimed by * ext4_mb_new_blocks(), so release the quota reservations made for * any previously delayed allocated clusters instead of claim them * again. */ resv_used += pending; if (resv_used) ext4_da_update_reserve_space(inode, resv_used, delalloc_reserve_used); if (err1 || err2 || err3 < 0) goto retry; ext4_es_print_tree(inode); return; } /* * ext4_es_cache_extent() inserts information into the extent status * tree if and only if there isn't information about the range in * question already. */ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len, ext4_fsblk_t pblk, unsigned int status) { struct extent_status *es; struct extent_status newes; ext4_lblk_t end = lblk + len - 1; if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) return; newes.es_lblk = lblk; newes.es_len = len; ext4_es_store_pblock_status(&newes, pblk, status); trace_ext4_es_cache_extent(inode, &newes); if (!len) return; BUG_ON(end < lblk); write_lock(&EXT4_I(inode)->i_es_lock); es = __es_tree_search(&EXT4_I(inode)->i_es_tree.root, lblk); if (!es || es->es_lblk > end) __es_insert_extent(inode, &newes, NULL); write_unlock(&EXT4_I(inode)->i_es_lock); } /* * ext4_es_lookup_extent() looks up an extent in extent status tree. * * ext4_es_lookup_extent is called by ext4_map_blocks/ext4_da_map_blocks. * * Return: 1 on found, 0 on not */ int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t *next_lblk, struct extent_status *es) { struct ext4_es_tree *tree; struct ext4_es_stats *stats; struct extent_status *es1 = NULL; struct rb_node *node; int found = 0; if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) return 0; trace_ext4_es_lookup_extent_enter(inode, lblk); es_debug("lookup extent in block %u\n", lblk); tree = &EXT4_I(inode)->i_es_tree; read_lock(&EXT4_I(inode)->i_es_lock); /* find extent in cache firstly */ es->es_lblk = es->es_len = es->es_pblk = 0; es1 = READ_ONCE(tree->cache_es); if (es1 && in_range(lblk, es1->es_lblk, es1->es_len)) { es_debug("%u cached by [%u/%u)\n", lblk, es1->es_lblk, es1->es_len); found = 1; goto out; } node = tree->root.rb_node; while (node) { es1 = rb_entry(node, struct extent_status, rb_node); if (lblk < es1->es_lblk) node = node->rb_left; else if (lblk > ext4_es_end(es1)) node = node->rb_right; else { found = 1; break; } } out: stats = &EXT4_SB(inode->i_sb)->s_es_stats; if (found) { BUG_ON(!es1); es->es_lblk = es1->es_lblk; es->es_len = es1->es_len; es->es_pblk = es1->es_pblk; if (!ext4_es_is_referenced(es1)) ext4_es_set_referenced(es1); percpu_counter_inc(&stats->es_stats_cache_hits); if (next_lblk) { node = rb_next(&es1->rb_node); if (node) { es1 = rb_entry(node, struct extent_status, rb_node); *next_lblk = es1->es_lblk; } else *next_lblk = 0; } } else { percpu_counter_inc(&stats->es_stats_cache_misses); } read_unlock(&EXT4_I(inode)->i_es_lock); trace_ext4_es_lookup_extent_exit(inode, es, found); return found; } struct rsvd_count { int ndelayed; bool first_do_lblk_found; ext4_lblk_t first_do_lblk; ext4_lblk_t last_do_lblk; struct extent_status *left_es; bool partial; ext4_lblk_t lclu; }; /* * init_rsvd - initialize reserved count data before removing block range * in file from extent status tree * * @inode - file containing range * @lblk - first block in range * @es - pointer to first extent in range * @rc - pointer to reserved count data * * Assumes es is not NULL */ static void init_rsvd(struct inode *inode, ext4_lblk_t lblk, struct extent_status *es, struct rsvd_count *rc) { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); struct rb_node *node; rc->ndelayed = 0; /* * for bigalloc, note the first delayed block in the range has not * been found, record the extent containing the block to the left of * the region to be removed, if any, and note that there's no partial * cluster to track */ if (sbi->s_cluster_ratio > 1) { rc->first_do_lblk_found = false; if (lblk > es->es_lblk) { rc->left_es = es; } else { node = rb_prev(&es->rb_node); rc->left_es = node ? rb_entry(node, struct extent_status, rb_node) : NULL; } rc->partial = false; } } /* * count_rsvd - count the clusters containing delayed blocks in a range * within an extent and add to the running tally in rsvd_count * * @inode - file containing extent * @lblk - first block in range * @len - length of range in blocks * @es - pointer to extent containing clusters to be counted * @rc - pointer to reserved count data * * Tracks partial clusters found at the beginning and end of extents so * they aren't overcounted when they span adjacent extents */ static void count_rsvd(struct inode *inode, ext4_lblk_t lblk, long len, struct extent_status *es, struct rsvd_count *rc) { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); ext4_lblk_t i, end, nclu; if (!ext4_es_is_delayed(es)) return; WARN_ON(len <= 0); if (sbi->s_cluster_ratio == 1) { rc->ndelayed += (int) len; return; } /* bigalloc */ i = (lblk < es->es_lblk) ? es->es_lblk : lblk; end = lblk + (ext4_lblk_t) len - 1; end = (end > ext4_es_end(es)) ? ext4_es_end(es) : end; /* record the first block of the first delayed extent seen */ if (!rc->first_do_lblk_found) { rc->first_do_lblk = i; rc->first_do_lblk_found = true; } /* update the last lblk in the region seen so far */ rc->last_do_lblk = end; /* * if we're tracking a partial cluster and the current extent * doesn't start with it, count it and stop tracking */ if (rc->partial && (rc->lclu != EXT4_B2C(sbi, i))) { rc->ndelayed++; rc->partial = false; } /* * if the first cluster doesn't start on a cluster boundary but * ends on one, count it */ if (EXT4_LBLK_COFF(sbi, i) != 0) { if (end >= EXT4_LBLK_CFILL(sbi, i)) { rc->ndelayed++; rc->partial = false; i = EXT4_LBLK_CFILL(sbi, i) + 1; } } /* * if the current cluster starts on a cluster boundary, count the * number of whole delayed clusters in the extent */ if ((i + sbi->s_cluster_ratio - 1) <= end) { nclu = (end - i + 1) >> sbi->s_cluster_bits; rc->ndelayed += nclu; i += nclu << sbi->s_cluster_bits; } /* * start tracking a partial cluster if there's a partial at the end * of the current extent and we're not already tracking one */ if (!rc->partial && i <= end) { rc->partial = true; rc->lclu = EXT4_B2C(sbi, i); } } /* * __pr_tree_search - search for a pending cluster reservation * * @root - root of pending reservation tree * @lclu - logical cluster to search for * * Returns the pending reservation for the cluster identified by @lclu * if found. If not, returns a reservation for the next cluster if any, * and if not, returns NULL. */ static struct pending_reservation *__pr_tree_search(struct rb_root *root, ext4_lblk_t lclu) { struct rb_node *node = root->rb_node; struct pending_reservation *pr = NULL; while (node) { pr = rb_entry(node, struct pending_reservation, rb_node); if (lclu < pr->lclu) node = node->rb_left; else if (lclu > pr->lclu) node = node->rb_right; else return pr; } if (pr && lclu < pr->lclu) return pr; if (pr && lclu > pr->lclu) { node = rb_next(&pr->rb_node); return node ? rb_entry(node, struct pending_reservation, rb_node) : NULL; } return NULL; } /* * get_rsvd - calculates and returns the number of cluster reservations to be * released when removing a block range from the extent status tree * and releases any pending reservations within the range * * @inode - file containing block range * @end - last block in range * @right_es - pointer to extent containing next block beyond end or NULL * @rc - pointer to reserved count data * * The number of reservations to be released is equal to the number of * clusters containing delayed blocks within the range, minus the number of * clusters still containing delayed blocks at the ends of the range, and * minus the number of pending reservations within the range. */ static unsigned int get_rsvd(struct inode *inode, ext4_lblk_t end, struct extent_status *right_es, struct rsvd_count *rc) { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); struct pending_reservation *pr; struct ext4_pending_tree *tree = &EXT4_I(inode)->i_pending_tree; struct rb_node *node; ext4_lblk_t first_lclu, last_lclu; bool left_delayed, right_delayed, count_pending; struct extent_status *es; if (sbi->s_cluster_ratio > 1) { /* count any remaining partial cluster */ if (rc->partial) rc->ndelayed++; if (rc->ndelayed == 0) return 0; first_lclu = EXT4_B2C(sbi, rc->first_do_lblk); last_lclu = EXT4_B2C(sbi, rc->last_do_lblk); /* * decrease the delayed count by the number of clusters at the * ends of the range that still contain delayed blocks - * these clusters still need to be reserved */ left_delayed = right_delayed = false; es = rc->left_es; while (es && ext4_es_end(es) >= EXT4_LBLK_CMASK(sbi, rc->first_do_lblk)) { if (ext4_es_is_delayed(es)) { rc->ndelayed--; left_delayed = true; break; } node = rb_prev(&es->rb_node); if (!node) break; es = rb_entry(node, struct extent_status, rb_node); } if (right_es && (!left_delayed || first_lclu != last_lclu)) { if (end < ext4_es_end(right_es)) { es = right_es; } else { node = rb_next(&right_es->rb_node); es = node ? rb_entry(node, struct extent_status, rb_node) : NULL; } while (es && es->es_lblk <= EXT4_LBLK_CFILL(sbi, rc->last_do_lblk)) { if (ext4_es_is_delayed(es)) { rc->ndelayed--; right_delayed = true; break; } node = rb_next(&es->rb_node); if (!node) break; es = rb_entry(node, struct extent_status, rb_node); } } /* * Determine the block range that should be searched for * pending reservations, if any. Clusters on the ends of the * original removed range containing delayed blocks are * excluded. They've already been accounted for and it's not * possible to determine if an associated pending reservation * should be released with the information available in the * extents status tree. */ if (first_lclu == last_lclu) { if (left_delayed | right_delayed) count_pending = false; else count_pending = true; } else { if (left_delayed) first_lclu++; if (right_delayed) last_lclu--; if (first_lclu <= last_lclu) count_pending = true; else count_pending = false; } /* * a pending reservation found between first_lclu and last_lclu * represents an allocated cluster that contained at least one * delayed block, so the delayed total must be reduced by one * for each pending reservation found and released */ if (count_pending) { pr = __pr_tree_search(&tree->root, first_lclu); while (pr && pr->lclu <= last_lclu) { rc->ndelayed--; node = rb_next(&pr->rb_node); rb_erase(&pr->rb_node, &tree->root); __free_pending(pr); if (!node) break; pr = rb_entry(node, struct pending_reservation, rb_node); } } } return rc->ndelayed; } /* * __es_remove_extent - removes block range from extent status tree * * @inode - file containing range * @lblk - first block in range * @end - last block in range * @reserved - number of cluster reservations released * @prealloc - pre-allocated es to avoid memory allocation failures * * If @reserved is not NULL and delayed allocation is enabled, counts * block/cluster reservations freed by removing range and if bigalloc * enabled cancels pending reservations as needed. Returns 0 on success, * error code on failure. */ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t end, int *reserved, struct extent_status *prealloc) { struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree; struct rb_node *node; struct extent_status *es; struct extent_status orig_es; ext4_lblk_t len1, len2; ext4_fsblk_t block; int err = 0; bool count_reserved = true; struct rsvd_count rc; if (reserved == NULL || !test_opt(inode->i_sb, DELALLOC)) count_reserved = false; es = __es_tree_search(&tree->root, lblk); if (!es) goto out; if (es->es_lblk > end) goto out; /* Simply invalidate cache_es. */ tree->cache_es = NULL; if (count_reserved) init_rsvd(inode, lblk, es, &rc); orig_es.es_lblk = es->es_lblk; orig_es.es_len = es->es_len; orig_es.es_pblk = es->es_pblk; len1 = lblk > es->es_lblk ? lblk - es->es_lblk : 0; len2 = ext4_es_end(es) > end ? ext4_es_end(es) - end : 0; if (len1 > 0) es->es_len = len1; if (len2 > 0) { if (len1 > 0) { struct extent_status newes; newes.es_lblk = end + 1; newes.es_len = len2; block = 0x7FDEADBEEFULL; if (ext4_es_is_written(&orig_es) || ext4_es_is_unwritten(&orig_es)) block = ext4_es_pblock(&orig_es) + orig_es.es_len - len2; ext4_es_store_pblock_status(&newes, block, ext4_es_status(&orig_es)); err = __es_insert_extent(inode, &newes, prealloc); if (err) { if (!ext4_es_must_keep(&newes)) return 0; es->es_lblk = orig_es.es_lblk; es->es_len = orig_es.es_len; goto out; } } else { es->es_lblk = end + 1; es->es_len = len2; if (ext4_es_is_written(es) || ext4_es_is_unwritten(es)) { block = orig_es.es_pblk + orig_es.es_len - len2; ext4_es_store_pblock(es, block); } } if (count_reserved) count_rsvd(inode, orig_es.es_lblk + len1, orig_es.es_len - len1 - len2, &orig_es, &rc); goto out_get_reserved; } if (len1 > 0) { if (count_reserved) count_rsvd(inode, lblk, orig_es.es_len - len1, &orig_es, &rc); node = rb_next(&es->rb_node); if (node) es = rb_entry(node, struct extent_status, rb_node); else es = NULL; } while (es && ext4_es_end(es) <= end) { if (count_reserved) count_rsvd(inode, es->es_lblk, es->es_len, es, &rc); node = rb_next(&es->rb_node); rb_erase(&es->rb_node, &tree->root); ext4_es_free_extent(inode, es); if (!node) { es = NULL; break; } es = rb_entry(node, struct extent_status, rb_node); } if (es && es->es_lblk < end + 1) { ext4_lblk_t orig_len = es->es_len; len1 = ext4_es_end(es) - end; if (count_reserved) count_rsvd(inode, es->es_lblk, orig_len - len1, es, &rc); es->es_lblk = end + 1; es->es_len = len1; if (ext4_es_is_written(es) || ext4_es_is_unwritten(es)) { block = es->es_pblk + orig_len - len1; ext4_es_store_pblock(es, block); } } out_get_reserved: if (count_reserved) *reserved = get_rsvd(inode, end, es, &rc); out: return err; } /* * ext4_es_remove_extent - removes block range from extent status tree * * @inode - file containing range * @lblk - first block in range * @len - number of blocks to remove * * Reduces block/cluster reservation count and for bigalloc cancels pending * reservations as needed. */ void ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len) { ext4_lblk_t end; int err = 0; int reserved = 0; struct extent_status *es = NULL; if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) return; trace_ext4_es_remove_extent(inode, lblk, len); es_debug("remove [%u/%u) from extent status tree of inode %lu\n", lblk, len, inode->i_ino); if (!len) return; end = lblk + len - 1; BUG_ON(end < lblk); retry: if (err && !es) es = __es_alloc_extent(true); /* * ext4_clear_inode() depends on us taking i_es_lock unconditionally * so that we are sure __es_shrink() is done with the inode before it * is reclaimed. */ write_lock(&EXT4_I(inode)->i_es_lock); err = __es_remove_extent(inode, lblk, end, &reserved, es); /* Free preallocated extent if it didn't get used. */ if (es) { if (!es->es_len) __es_free_extent(es); es = NULL; } write_unlock(&EXT4_I(inode)->i_es_lock); if (err) goto retry; ext4_es_print_tree(inode); ext4_da_release_space(inode, reserved); } static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan, struct ext4_inode_info *locked_ei) { struct ext4_inode_info *ei; struct ext4_es_stats *es_stats; ktime_t start_time; u64 scan_time; int nr_to_walk; int nr_shrunk = 0; int retried = 0, nr_skipped = 0; es_stats = &sbi->s_es_stats; start_time = ktime_get(); retry: spin_lock(&sbi->s_es_lock); nr_to_walk = sbi->s_es_nr_inode; while (nr_to_walk-- > 0) { if (list_empty(&sbi->s_es_list)) { spin_unlock(&sbi->s_es_lock); goto out; } ei = list_first_entry(&sbi->s_es_list, struct ext4_inode_info, i_es_list); /* Move the inode to the tail */ list_move_tail(&ei->i_es_list, &sbi->s_es_list); /* * Normally we try hard to avoid shrinking precached inodes, * but we will as a last resort. */ if (!retried && ext4_test_inode_state(&ei->vfs_inode, EXT4_STATE_EXT_PRECACHED)) { nr_skipped++; continue; } if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) { nr_skipped++; continue; } /* * Now we hold i_es_lock which protects us from inode reclaim * freeing inode under us */ spin_unlock(&sbi->s_es_lock); nr_shrunk += es_reclaim_extents(ei, &nr_to_scan); write_unlock(&ei->i_es_lock); if (nr_to_scan <= 0) goto out; spin_lock(&sbi->s_es_lock); } spin_unlock(&sbi->s_es_lock); /* * If we skipped any inodes, and we weren't able to make any * forward progress, try again to scan precached inodes. */ if ((nr_shrunk == 0) && nr_skipped && !retried) { retried++; goto retry; } if (locked_ei && nr_shrunk == 0) nr_shrunk = es_reclaim_extents(locked_ei, &nr_to_scan); out: scan_time = ktime_to_ns(ktime_sub(ktime_get(), start_time)); if (likely(es_stats->es_stats_scan_time)) es_stats->es_stats_scan_time = (scan_time + es_stats->es_stats_scan_time*3) / 4; else es_stats->es_stats_scan_time = scan_time; if (scan_time > es_stats->es_stats_max_scan_time) es_stats->es_stats_max_scan_time = scan_time; if (likely(es_stats->es_stats_shrunk)) es_stats->es_stats_shrunk = (nr_shrunk + es_stats->es_stats_shrunk*3) / 4; else es_stats->es_stats_shrunk = nr_shrunk; trace_ext4_es_shrink(sbi->s_sb, nr_shrunk, scan_time, nr_skipped, retried); return nr_shrunk; } static unsigned long ext4_es_count(struct shrinker *shrink, struct shrink_control *sc) { unsigned long nr; struct ext4_sb_info *sbi; sbi = shrink->private_data; nr = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt); trace_ext4_es_shrink_count(sbi->s_sb, sc->nr_to_scan, nr); return nr; } static unsigned long ext4_es_scan(struct shrinker *shrink, struct shrink_control *sc) { struct ext4_sb_info *sbi = shrink->private_data; int nr_to_scan = sc->nr_to_scan; int ret, nr_shrunk; ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt); trace_ext4_es_shrink_scan_enter(sbi->s_sb, nr_to_scan, ret); nr_shrunk = __es_shrink(sbi, nr_to_scan, NULL); ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt); trace_ext4_es_shrink_scan_exit(sbi->s_sb, nr_shrunk, ret); return nr_shrunk; } int ext4_seq_es_shrinker_info_show(struct seq_file *seq, void *v) { struct ext4_sb_info *sbi = EXT4_SB((struct super_block *) seq->private); struct ext4_es_stats *es_stats = &sbi->s_es_stats; struct ext4_inode_info *ei, *max = NULL; unsigned int inode_cnt = 0; if (v != SEQ_START_TOKEN) return 0; /* here we just find an inode that has the max nr. of objects */ spin_lock(&sbi->s_es_lock); list_for_each_entry(ei, &sbi->s_es_list, i_es_list) { inode_cnt++; if (max && max->i_es_all_nr < ei->i_es_all_nr) max = ei; else if (!max) max = ei; } spin_unlock(&sbi->s_es_lock); seq_printf(seq, "stats:\n %lld objects\n %lld reclaimable objects\n", percpu_counter_sum_positive(&es_stats->es_stats_all_cnt), percpu_counter_sum_positive(&es_stats->es_stats_shk_cnt)); seq_printf(seq, " %lld/%lld cache hits/misses\n", percpu_counter_sum_positive(&es_stats->es_stats_cache_hits), percpu_counter_sum_positive(&es_stats->es_stats_cache_misses)); if (inode_cnt) seq_printf(seq, " %d inodes on list\n", inode_cnt); seq_printf(seq, "average:\n %llu us scan time\n", div_u64(es_stats->es_stats_scan_time, 1000)); seq_printf(seq, " %lu shrunk objects\n", es_stats->es_stats_shrunk); if (inode_cnt) seq_printf(seq, "maximum:\n %lu inode (%u objects, %u reclaimable)\n" " %llu us max scan time\n", max->vfs_inode.i_ino, max->i_es_all_nr, max->i_es_shk_nr, div_u64(es_stats->es_stats_max_scan_time, 1000)); return 0; } int ext4_es_register_shrinker(struct ext4_sb_info *sbi) { int err; /* Make sure we have enough bits for physical block number */ BUILD_BUG_ON(ES_SHIFT < 48); INIT_LIST_HEAD(&sbi->s_es_list); sbi->s_es_nr_inode = 0; spin_lock_init(&sbi->s_es_lock); sbi->s_es_stats.es_stats_shrunk = 0; err = percpu_counter_init(&sbi->s_es_stats.es_stats_cache_hits, 0, GFP_KERNEL); if (err) return err; err = percpu_counter_init(&sbi->s_es_stats.es_stats_cache_misses, 0, GFP_KERNEL); if (err) goto err1; sbi->s_es_stats.es_stats_scan_time = 0; sbi->s_es_stats.es_stats_max_scan_time = 0; err = percpu_counter_init(&sbi->s_es_stats.es_stats_all_cnt, 0, GFP_KERNEL); if (err) goto err2; err = percpu_counter_init(&sbi->s_es_stats.es_stats_shk_cnt, 0, GFP_KERNEL); if (err) goto err3; sbi->s_es_shrinker = shrinker_alloc(0, "ext4-es:%s", sbi->s_sb->s_id); if (!sbi->s_es_shrinker) { err = -ENOMEM; goto err4; } sbi->s_es_shrinker->scan_objects = ext4_es_scan; sbi->s_es_shrinker->count_objects = ext4_es_count; sbi->s_es_shrinker->private_data = sbi; shrinker_register(sbi->s_es_shrinker); return 0; err4: percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt); err3: percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt); err2: percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_misses); err1: percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_hits); return err; } void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi) { percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_hits); percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_misses); percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt); percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt); shrinker_free(sbi->s_es_shrinker); } /* * Shrink extents in given inode from ei->i_es_shrink_lblk till end. Scan at * most *nr_to_scan extents, update *nr_to_scan accordingly. * * Return 0 if we hit end of tree / interval, 1 if we exhausted nr_to_scan. * Increment *nr_shrunk by the number of reclaimed extents. Also update * ei->i_es_shrink_lblk to where we should continue scanning. */ static int es_do_reclaim_extents(struct ext4_inode_info *ei, ext4_lblk_t end, int *nr_to_scan, int *nr_shrunk) { struct inode *inode = &ei->vfs_inode; struct ext4_es_tree *tree = &ei->i_es_tree; struct extent_status *es; struct rb_node *node; es = __es_tree_search(&tree->root, ei->i_es_shrink_lblk); if (!es) goto out_wrap; while (*nr_to_scan > 0) { if (es->es_lblk > end) { ei->i_es_shrink_lblk = end + 1; return 0; } (*nr_to_scan)--; node = rb_next(&es->rb_node); if (ext4_es_must_keep(es)) goto next; if (ext4_es_is_referenced(es)) { ext4_es_clear_referenced(es); goto next; } rb_erase(&es->rb_node, &tree->root); ext4_es_free_extent(inode, es); (*nr_shrunk)++; next: if (!node) goto out_wrap; es = rb_entry(node, struct extent_status, rb_node); } ei->i_es_shrink_lblk = es->es_lblk; return 1; out_wrap: ei->i_es_shrink_lblk = 0; return 0; } static int es_reclaim_extents(struct ext4_inode_info *ei, int *nr_to_scan) { struct inode *inode = &ei->vfs_inode; int nr_shrunk = 0; ext4_lblk_t start = ei->i_es_shrink_lblk; static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); if (ei->i_es_shk_nr == 0) return 0; if (ext4_test_inode_state(inode, EXT4_STATE_EXT_PRECACHED) && __ratelimit(&_rs)) ext4_warning(inode->i_sb, "forced shrink of precached extents"); if (!es_do_reclaim_extents(ei, EXT_MAX_BLOCKS, nr_to_scan, &nr_shrunk) && start != 0) es_do_reclaim_extents(ei, start - 1, nr_to_scan, &nr_shrunk); ei->i_es_tree.cache_es = NULL; return nr_shrunk; } /* * Called to support EXT4_IOC_CLEAR_ES_CACHE. We can only remove * discretionary entries from the extent status cache. (Some entries * must be present for proper operations.) */ void ext4_clear_inode_es(struct inode *inode) { struct ext4_inode_info *ei = EXT4_I(inode); struct extent_status *es; struct ext4_es_tree *tree; struct rb_node *node; write_lock(&ei->i_es_lock); tree = &EXT4_I(inode)->i_es_tree; tree->cache_es = NULL; node = rb_first(&tree->root); while (node) { es = rb_entry(node, struct extent_status, rb_node); node = rb_next(node); if (!ext4_es_must_keep(es)) { rb_erase(&es->rb_node, &tree->root); ext4_es_free_extent(inode, es); } } ext4_clear_inode_state(inode, EXT4_STATE_EXT_PRECACHED); write_unlock(&ei->i_es_lock); } #ifdef ES_DEBUG__ static void ext4_print_pending_tree(struct inode *inode) { struct ext4_pending_tree *tree; struct rb_node *node; struct pending_reservation *pr; printk(KERN_DEBUG "pending reservations for inode %lu:", inode->i_ino); tree = &EXT4_I(inode)->i_pending_tree; node = rb_first(&tree->root); while (node) { pr = rb_entry(node, struct pending_reservation, rb_node); printk(KERN_DEBUG " %u", pr->lclu); node = rb_next(node); } printk(KERN_DEBUG "\n"); } #else #define ext4_print_pending_tree(inode) #endif int __init ext4_init_pending(void) { ext4_pending_cachep = KMEM_CACHE(pending_reservation, SLAB_RECLAIM_ACCOUNT); if (ext4_pending_cachep == NULL) return -ENOMEM; return 0; } void ext4_exit_pending(void) { kmem_cache_destroy(ext4_pending_cachep); } void ext4_init_pending_tree(struct ext4_pending_tree *tree) { tree->root = RB_ROOT; } /* * __get_pending - retrieve a pointer to a pending reservation * * @inode - file containing the pending cluster reservation * @lclu - logical cluster of interest * * Returns a pointer to a pending reservation if it's a member of * the set, and NULL if not. Must be called holding i_es_lock. */ static struct pending_reservation *__get_pending(struct inode *inode, ext4_lblk_t lclu) { struct ext4_pending_tree *tree; struct rb_node *node; struct pending_reservation *pr = NULL; tree = &EXT4_I(inode)->i_pending_tree; node = (&tree->root)->rb_node; while (node) { pr = rb_entry(node, struct pending_reservation, rb_node); if (lclu < pr->lclu) node = node->rb_left; else if (lclu > pr->lclu) node = node->rb_right; else if (lclu == pr->lclu) return pr; } return NULL; } /* * __insert_pending - adds a pending cluster reservation to the set of * pending reservations * * @inode - file containing the cluster * @lblk - logical block in the cluster to be added * @prealloc - preallocated pending entry * * Returns 1 on successful insertion and -ENOMEM on failure. If the * pending reservation is already in the set, returns successfully. */ static int __insert_pending(struct inode *inode, ext4_lblk_t lblk, struct pending_reservation **prealloc) { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); struct ext4_pending_tree *tree = &EXT4_I(inode)->i_pending_tree; struct rb_node **p = &tree->root.rb_node; struct rb_node *parent = NULL; struct pending_reservation *pr; ext4_lblk_t lclu; int ret = 0; lclu = EXT4_B2C(sbi, lblk); /* search to find parent for insertion */ while (*p) { parent = *p; pr = rb_entry(parent, struct pending_reservation, rb_node); if (lclu < pr->lclu) { p = &(*p)->rb_left; } else if (lclu > pr->lclu) { p = &(*p)->rb_right; } else { /* pending reservation already inserted */ goto out; } } if (likely(*prealloc == NULL)) { pr = __alloc_pending(false); if (!pr) { ret = -ENOMEM; goto out; } } else { pr = *prealloc; *prealloc = NULL; } pr->lclu = lclu; rb_link_node(&pr->rb_node, parent, p); rb_insert_color(&pr->rb_node, &tree->root); ret = 1; out: return ret; } /* * __remove_pending - removes a pending cluster reservation from the set * of pending reservations * * @inode - file containing the cluster * @lblk - logical block in the pending cluster reservation to be removed * * Returns successfully if pending reservation is not a member of the set. */ static void __remove_pending(struct inode *inode, ext4_lblk_t lblk) { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); struct pending_reservation *pr; struct ext4_pending_tree *tree; pr = __get_pending(inode, EXT4_B2C(sbi, lblk)); if (pr != NULL) { tree = &EXT4_I(inode)->i_pending_tree; rb_erase(&pr->rb_node, &tree->root); __free_pending(pr); } } /* * ext4_remove_pending - removes a pending cluster reservation from the set * of pending reservations * * @inode - file containing the cluster * @lblk - logical block in the pending cluster reservation to be removed * * Locking for external use of __remove_pending. */ void ext4_remove_pending(struct inode *inode, ext4_lblk_t lblk) { struct ext4_inode_info *ei = EXT4_I(inode); write_lock(&ei->i_es_lock); __remove_pending(inode, lblk); write_unlock(&ei->i_es_lock); } /* * ext4_is_pending - determine whether a cluster has a pending reservation * on it * * @inode - file containing the cluster * @lblk - logical block in the cluster * * Returns true if there's a pending reservation for the cluster in the * set of pending reservations, and false if not. */ bool ext4_is_pending(struct inode *inode, ext4_lblk_t lblk) { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); struct ext4_inode_info *ei = EXT4_I(inode); bool ret; read_lock(&ei->i_es_lock); ret = (bool)(__get_pending(inode, EXT4_B2C(sbi, lblk)) != NULL); read_unlock(&ei->i_es_lock); return ret; } /* * ext4_es_insert_delayed_extent - adds some delayed blocks to the extents * status tree, adding a pending reservation * where needed * * @inode - file containing the newly added block * @lblk - start logical block to be added * @len - length of blocks to be added * @lclu_allocated/end_allocated - indicates whether a physical cluster has * been allocated for the logical cluster * that contains the start/end block. Note that * end_allocated should always be set to false * if the start and the end block are in the * same cluster */ void ext4_es_insert_delayed_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len, bool lclu_allocated, bool end_allocated) { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); struct extent_status newes; ext4_lblk_t end = lblk + len - 1; int err1 = 0, err2 = 0, err3 = 0; struct extent_status *es1 = NULL; struct extent_status *es2 = NULL; struct pending_reservation *pr1 = NULL; struct pending_reservation *pr2 = NULL; if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) return; es_debug("add [%u/%u) delayed to extent status tree of inode %lu\n", lblk, len, inode->i_ino); if (!len) return; WARN_ON_ONCE((EXT4_B2C(sbi, lblk) == EXT4_B2C(sbi, end)) && end_allocated); newes.es_lblk = lblk; newes.es_len = len; ext4_es_store_pblock_status(&newes, ~0, EXTENT_STATUS_DELAYED); trace_ext4_es_insert_delayed_extent(inode, &newes, lclu_allocated, end_allocated); ext4_es_insert_extent_check(inode, &newes); retry: if (err1 && !es1) es1 = __es_alloc_extent(true); if ((err1 || err2) && !es2) es2 = __es_alloc_extent(true); if (err1 || err2 || err3 < 0) { if (lclu_allocated && !pr1) pr1 = __alloc_pending(true); if (end_allocated && !pr2) pr2 = __alloc_pending(true); } write_lock(&EXT4_I(inode)->i_es_lock); err1 = __es_remove_extent(inode, lblk, end, NULL, es1); if (err1 != 0) goto error; /* Free preallocated extent if it didn't get used. */ if (es1) { if (!es1->es_len) __es_free_extent(es1); es1 = NULL; } err2 = __es_insert_extent(inode, &newes, es2); if (err2 != 0) goto error; /* Free preallocated extent if it didn't get used. */ if (es2) { if (!es2->es_len) __es_free_extent(es2); es2 = NULL; } if (lclu_allocated) { err3 = __insert_pending(inode, lblk, &pr1); if (err3 < 0) goto error; if (pr1) { __free_pending(pr1); pr1 = NULL; } } if (end_allocated) { err3 = __insert_pending(inode, end, &pr2); if (err3 < 0) goto error; if (pr2) { __free_pending(pr2); pr2 = NULL; } } error: write_unlock(&EXT4_I(inode)->i_es_lock); if (err1 || err2 || err3 < 0) goto retry; ext4_es_print_tree(inode); ext4_print_pending_tree(inode); return; } /* * __revise_pending - makes, cancels, or leaves unchanged pending cluster * reservations for a specified block range depending * upon the presence or absence of delayed blocks * outside the range within clusters at the ends of the * range * * @inode - file containing the range * @lblk - logical block defining the start of range * @len - length of range in blocks * @prealloc - preallocated pending entry * * Used after a newly allocated extent is added to the extents status tree. * Requires that the extents in the range have either written or unwritten * status. Must be called while holding i_es_lock. Returns number of new * inserts pending cluster on insert pendings, returns 0 on remove pendings, * return -ENOMEM on failure. */ static int __revise_pending(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len, struct pending_reservation **prealloc) { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); ext4_lblk_t end = lblk + len - 1; ext4_lblk_t first, last; bool f_del = false, l_del = false; int pendings = 0; int ret = 0; if (len == 0) return 0; /* * Two cases - block range within single cluster and block range * spanning two or more clusters. Note that a cluster belonging * to a range starting and/or ending on a cluster boundary is treated * as if it does not contain a delayed extent. The new range may * have allocated space for previously delayed blocks out to the * cluster boundary, requiring that any pre-existing pending * reservation be canceled. Because this code only looks at blocks * outside the range, it should revise pending reservations * correctly even if the extent represented by the range can't be * inserted in the extents status tree due to ENOSPC. */ if (EXT4_B2C(sbi, lblk) == EXT4_B2C(sbi, end)) { first = EXT4_LBLK_CMASK(sbi, lblk); if (first != lblk) f_del = __es_scan_range(inode, &ext4_es_is_delayed, first, lblk - 1); if (f_del) { ret = __insert_pending(inode, first, prealloc); if (ret < 0) goto out; pendings += ret; } else { last = EXT4_LBLK_CMASK(sbi, end) + sbi->s_cluster_ratio - 1; if (last != end) l_del = __es_scan_range(inode, &ext4_es_is_delayed, end + 1, last); if (l_del) { ret = __insert_pending(inode, last, prealloc); if (ret < 0) goto out; pendings += ret; } else __remove_pending(inode, last); } } else { first = EXT4_LBLK_CMASK(sbi, lblk); if (first != lblk) f_del = __es_scan_range(inode, &ext4_es_is_delayed, first, lblk - 1); if (f_del) { ret = __insert_pending(inode, first, prealloc); if (ret < 0) goto out; pendings += ret; } else __remove_pending(inode, first); last = EXT4_LBLK_CMASK(sbi, end) + sbi->s_cluster_ratio - 1; if (last != end) l_del = __es_scan_range(inode, &ext4_es_is_delayed, end + 1, last); if (l_del) { ret = __insert_pending(inode, last, prealloc); if (ret < 0) goto out; pendings += ret; } else __remove_pending(inode, last); } out: return (ret < 0) ? ret : pendings; } |
| 49 49 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB /* * Copyright (c) 2017 Mellanox Technologies Ltd. All rights reserved. */ #include "rxe.h" #include "rxe_hw_counters.h" static const struct rdma_stat_desc rxe_counter_descs[] = { [RXE_CNT_SENT_PKTS].name = "sent_pkts", [RXE_CNT_RCVD_PKTS].name = "rcvd_pkts", [RXE_CNT_DUP_REQ].name = "duplicate_request", [RXE_CNT_OUT_OF_SEQ_REQ].name = "out_of_seq_request", [RXE_CNT_RCV_RNR].name = "rcvd_rnr_err", [RXE_CNT_SND_RNR].name = "send_rnr_err", [RXE_CNT_RCV_SEQ_ERR].name = "rcvd_seq_err", [RXE_CNT_SENDER_SCHED].name = "ack_deferred", [RXE_CNT_RETRY_EXCEEDED].name = "retry_exceeded_err", [RXE_CNT_RNR_RETRY_EXCEEDED].name = "retry_rnr_exceeded_err", [RXE_CNT_COMP_RETRY].name = "completer_retry_err", [RXE_CNT_SEND_ERR].name = "send_err", [RXE_CNT_LINK_DOWNED].name = "link_downed", [RXE_CNT_RDMA_SEND].name = "rdma_sends", [RXE_CNT_RDMA_RECV].name = "rdma_recvs", }; int rxe_ib_get_hw_stats(struct ib_device *ibdev, struct rdma_hw_stats *stats, u32 port, int index) { struct rxe_dev *dev = to_rdev(ibdev); unsigned int cnt; if (!port || !stats) return -EINVAL; for (cnt = 0; cnt < ARRAY_SIZE(rxe_counter_descs); cnt++) stats->value[cnt] = atomic64_read(&dev->stats_counters[cnt]); return ARRAY_SIZE(rxe_counter_descs); } struct rdma_hw_stats *rxe_ib_alloc_hw_port_stats(struct ib_device *ibdev, u32 port_num) { BUILD_BUG_ON(ARRAY_SIZE(rxe_counter_descs) != RXE_NUM_OF_COUNTERS); return rdma_alloc_hw_stats_struct(rxe_counter_descs, ARRAY_SIZE(rxe_counter_descs), RDMA_HW_STATS_DEFAULT_LIFESPAN); } |
| 82 82 86 86 102 102 108 93 87 82 82 214 214 3 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. */ #include "peerlookup.h" #include "peer.h" #include "noise.h" static struct hlist_head *pubkey_bucket(struct pubkey_hashtable *table, const u8 pubkey[NOISE_PUBLIC_KEY_LEN]) { /* siphash gives us a secure 64bit number based on a random key. Since * the bits are uniformly distributed, we can then mask off to get the * bits we need. */ const u64 hash = siphash(pubkey, NOISE_PUBLIC_KEY_LEN, &table->key); return &table->hashtable[hash & (HASH_SIZE(table->hashtable) - 1)]; } struct pubkey_hashtable *wg_pubkey_hashtable_alloc(void) { struct pubkey_hashtable *table = kvmalloc(sizeof(*table), GFP_KERNEL); if (!table) return NULL; get_random_bytes(&table->key, sizeof(table->key)); hash_init(table->hashtable); mutex_init(&table->lock); return table; } void wg_pubkey_hashtable_add(struct pubkey_hashtable *table, struct wg_peer *peer) { mutex_lock(&table->lock); hlist_add_head_rcu(&peer->pubkey_hash, pubkey_bucket(table, peer->handshake.remote_static)); mutex_unlock(&table->lock); } void wg_pubkey_hashtable_remove(struct pubkey_hashtable *table, struct wg_peer *peer) { mutex_lock(&table->lock); hlist_del_init_rcu(&peer->pubkey_hash); mutex_unlock(&table->lock); } /* Returns a strong reference to a peer */ struct wg_peer * wg_pubkey_hashtable_lookup(struct pubkey_hashtable *table, const u8 pubkey[NOISE_PUBLIC_KEY_LEN]) { struct wg_peer *iter_peer, *peer = NULL; rcu_read_lock_bh(); hlist_for_each_entry_rcu_bh(iter_peer, pubkey_bucket(table, pubkey), pubkey_hash) { if (!memcmp(pubkey, iter_peer->handshake.remote_static, NOISE_PUBLIC_KEY_LEN)) { peer = iter_peer; break; } } peer = wg_peer_get_maybe_zero(peer); rcu_read_unlock_bh(); return peer; } static struct hlist_head *index_bucket(struct index_hashtable *table, const __le32 index) { /* Since the indices are random and thus all bits are uniformly * distributed, we can find its bucket simply by masking. */ return &table->hashtable[(__force u32)index & (HASH_SIZE(table->hashtable) - 1)]; } struct index_hashtable *wg_index_hashtable_alloc(void) { struct index_hashtable *table = kvmalloc(sizeof(*table), GFP_KERNEL); if (!table) return NULL; hash_init(table->hashtable); spin_lock_init(&table->lock); return table; } /* At the moment, we limit ourselves to 2^20 total peers, which generally might * amount to 2^20*3 items in this hashtable. The algorithm below works by * picking a random number and testing it. We can see that these limits mean we * usually succeed pretty quickly: * * >>> def calculation(tries, size): * ... return (size / 2**32)**(tries - 1) * (1 - (size / 2**32)) * ... * >>> calculation(1, 2**20 * 3) * 0.999267578125 * >>> calculation(2, 2**20 * 3) * 0.0007318854331970215 * >>> calculation(3, 2**20 * 3) * 5.360489012673497e-07 * >>> calculation(4, 2**20 * 3) * 3.9261394135792216e-10 * * At the moment, we don't do any masking, so this algorithm isn't exactly * constant time in either the random guessing or in the hash list lookup. We * could require a minimum of 3 tries, which would successfully mask the * guessing. this would not, however, help with the growing hash lengths, which * is another thing to consider moving forward. */ __le32 wg_index_hashtable_insert(struct index_hashtable *table, struct index_hashtable_entry *entry) { struct index_hashtable_entry *existing_entry; spin_lock_bh(&table->lock); hlist_del_init_rcu(&entry->index_hash); spin_unlock_bh(&table->lock); rcu_read_lock_bh(); search_unused_slot: /* First we try to find an unused slot, randomly, while unlocked. */ entry->index = (__force __le32)get_random_u32(); hlist_for_each_entry_rcu_bh(existing_entry, index_bucket(table, entry->index), index_hash) { if (existing_entry->index == entry->index) /* If it's already in use, we continue searching. */ goto search_unused_slot; } /* Once we've found an unused slot, we lock it, and then double-check * that nobody else stole it from us. */ spin_lock_bh(&table->lock); hlist_for_each_entry_rcu_bh(existing_entry, index_bucket(table, entry->index), index_hash) { if (existing_entry->index == entry->index) { spin_unlock_bh(&table->lock); /* If it was stolen, we start over. */ goto search_unused_slot; } } /* Otherwise, we know we have it exclusively (since we're locked), * so we insert. */ hlist_add_head_rcu(&entry->index_hash, index_bucket(table, entry->index)); spin_unlock_bh(&table->lock); rcu_read_unlock_bh(); return entry->index; } bool wg_index_hashtable_replace(struct index_hashtable *table, struct index_hashtable_entry *old, struct index_hashtable_entry *new) { bool ret; spin_lock_bh(&table->lock); ret = !hlist_unhashed(&old->index_hash); if (unlikely(!ret)) goto out; new->index = old->index; hlist_replace_rcu(&old->index_hash, &new->index_hash); /* Calling init here NULLs out index_hash, and in fact after this * function returns, it's theoretically possible for this to get * reinserted elsewhere. That means the RCU lookup below might either * terminate early or jump between buckets, in which case the packet * simply gets dropped, which isn't terrible. */ INIT_HLIST_NODE(&old->index_hash); out: spin_unlock_bh(&table->lock); return ret; } void wg_index_hashtable_remove(struct index_hashtable *table, struct index_hashtable_entry *entry) { spin_lock_bh(&table->lock); hlist_del_init_rcu(&entry->index_hash); spin_unlock_bh(&table->lock); } /* Returns a strong reference to a entry->peer */ struct index_hashtable_entry * wg_index_hashtable_lookup(struct index_hashtable *table, const enum index_hashtable_type type_mask, const __le32 index, struct wg_peer **peer) { struct index_hashtable_entry *iter_entry, *entry = NULL; rcu_read_lock_bh(); hlist_for_each_entry_rcu_bh(iter_entry, index_bucket(table, index), index_hash) { if (iter_entry->index == index) { if (likely(iter_entry->type & type_mask)) entry = iter_entry; break; } } if (likely(entry)) { entry->peer = wg_peer_get_maybe_zero(entry->peer); if (likely(entry->peer)) *peer = entry->peer; else entry = NULL; } rcu_read_unlock_bh(); return entry; } |
| 35 36 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | // SPDX-License-Identifier: GPL-2.0-or-later /* Function to determine if a thread group is single threaded or not * * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved. * Written by David Howells (dhowells@redhat.com) * - Derived from security/selinux/hooks.c */ #include <linux/sched/signal.h> #include <linux/sched/task.h> #include <linux/sched/mm.h> /* * Returns true if the task does not share ->mm with another thread/process. */ bool current_is_single_threaded(void) { struct task_struct *task = current; struct mm_struct *mm = task->mm; struct task_struct *p, *t; bool ret; if (atomic_read(&task->signal->live) != 1) return false; if (atomic_read(&mm->mm_users) == 1) return true; ret = false; rcu_read_lock(); for_each_process(p) { if (unlikely(p->flags & PF_KTHREAD)) continue; if (unlikely(p == task->group_leader)) continue; for_each_thread(p, t) { if (unlikely(t->mm == mm)) goto found; if (likely(t->mm)) break; /* * t->mm == NULL. Make sure next_thread/next_task * will see other CLONE_VM tasks which might be * forked before exiting. */ smp_rmb(); } } ret = true; found: rcu_read_unlock(); return ret; } |
| 9 4 1 4 1 2 2 1 3 1 2 5 4 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 | // SPDX-License-Identifier: GPL-2.0-only /* * 32bit Socket syscall emulation. Based on arch/sparc64/kernel/sys_sparc32.c. * * Copyright (C) 2000 VA Linux Co * Copyright (C) 2000 Don Dugger <n0ano@valinux.com> * Copyright (C) 1999 Arun Sharma <arun.sharma@intel.com> * Copyright (C) 1997,1998 Jakub Jelinek (jj@sunsite.mff.cuni.cz) * Copyright (C) 1997 David S. Miller (davem@caip.rutgers.edu) * Copyright (C) 2000 Hewlett-Packard Co. * Copyright (C) 2000 David Mosberger-Tang <davidm@hpl.hp.com> * Copyright (C) 2000,2001 Andi Kleen, SuSE Labs */ #include <linux/kernel.h> #include <linux/gfp.h> #include <linux/fs.h> #include <linux/types.h> #include <linux/file.h> #include <linux/icmpv6.h> #include <linux/socket.h> #include <linux/syscalls.h> #include <linux/filter.h> #include <linux/compat.h> #include <linux/security.h> #include <linux/audit.h> #include <linux/export.h> #include <net/scm.h> #include <net/sock.h> #include <net/ip.h> #include <net/ipv6.h> #include <linux/uaccess.h> #include <net/compat.h> int __get_compat_msghdr(struct msghdr *kmsg, struct compat_msghdr *msg, struct sockaddr __user **save_addr) { ssize_t err; kmsg->msg_flags = msg->msg_flags; kmsg->msg_namelen = msg->msg_namelen; if (!msg->msg_name) kmsg->msg_namelen = 0; if (kmsg->msg_namelen < 0) return -EINVAL; if (kmsg->msg_namelen > sizeof(struct sockaddr_storage)) kmsg->msg_namelen = sizeof(struct sockaddr_storage); kmsg->msg_control_is_user = true; kmsg->msg_get_inq = 0; kmsg->msg_control_user = compat_ptr(msg->msg_control); kmsg->msg_controllen = msg->msg_controllen; if (save_addr) *save_addr = compat_ptr(msg->msg_name); if (msg->msg_name && kmsg->msg_namelen) { if (!save_addr) { err = move_addr_to_kernel(compat_ptr(msg->msg_name), kmsg->msg_namelen, kmsg->msg_name); if (err < 0) return err; } } else { kmsg->msg_name = NULL; kmsg->msg_namelen = 0; } if (msg->msg_iovlen > UIO_MAXIOV) return -EMSGSIZE; kmsg->msg_iocb = NULL; kmsg->msg_ubuf = NULL; return 0; } int get_compat_msghdr(struct msghdr *kmsg, struct compat_msghdr __user *umsg, struct sockaddr __user **save_addr, struct iovec **iov) { struct compat_msghdr msg; ssize_t err; if (copy_from_user(&msg, umsg, sizeof(*umsg))) return -EFAULT; err = __get_compat_msghdr(kmsg, &msg, save_addr); if (err) return err; err = import_iovec(save_addr ? ITER_DEST : ITER_SOURCE, compat_ptr(msg.msg_iov), msg.msg_iovlen, UIO_FASTIOV, iov, &kmsg->msg_iter); return err < 0 ? err : 0; } /* Bleech... */ #define CMSG_COMPAT_ALIGN(len) ALIGN((len), sizeof(s32)) #define CMSG_COMPAT_DATA(cmsg) \ ((void __user *)((char __user *)(cmsg) + sizeof(struct compat_cmsghdr))) #define CMSG_COMPAT_SPACE(len) \ (sizeof(struct compat_cmsghdr) + CMSG_COMPAT_ALIGN(len)) #define CMSG_COMPAT_LEN(len) \ (sizeof(struct compat_cmsghdr) + (len)) #define CMSG_COMPAT_FIRSTHDR(msg) \ (((msg)->msg_controllen) >= sizeof(struct compat_cmsghdr) ? \ (struct compat_cmsghdr __user *)((msg)->msg_control_user) : \ (struct compat_cmsghdr __user *)NULL) #define CMSG_COMPAT_OK(ucmlen, ucmsg, mhdr) \ ((ucmlen) >= sizeof(struct compat_cmsghdr) && \ (ucmlen) <= (unsigned long) \ ((mhdr)->msg_controllen - \ ((char __user *)(ucmsg) - (char __user *)(mhdr)->msg_control_user))) static inline struct compat_cmsghdr __user *cmsg_compat_nxthdr(struct msghdr *msg, struct compat_cmsghdr __user *cmsg, int cmsg_len) { char __user *ptr = (char __user *)cmsg + CMSG_COMPAT_ALIGN(cmsg_len); if ((unsigned long)(ptr + 1 - (char __user *)msg->msg_control_user) > msg->msg_controllen) return NULL; return (struct compat_cmsghdr __user *)ptr; } /* There is a lot of hair here because the alignment rules (and * thus placement) of cmsg headers and length are different for * 32-bit apps. -DaveM */ int cmsghdr_from_user_compat_to_kern(struct msghdr *kmsg, struct sock *sk, unsigned char *stackbuf, int stackbuf_size) { struct compat_cmsghdr __user *ucmsg; struct cmsghdr *kcmsg, *kcmsg_base; compat_size_t ucmlen; __kernel_size_t kcmlen, tmp; int err = -EFAULT; BUILD_BUG_ON(sizeof(struct compat_cmsghdr) != CMSG_COMPAT_ALIGN(sizeof(struct compat_cmsghdr))); kcmlen = 0; kcmsg_base = kcmsg = (struct cmsghdr *)stackbuf; ucmsg = CMSG_COMPAT_FIRSTHDR(kmsg); while (ucmsg != NULL) { if (get_user(ucmlen, &ucmsg->cmsg_len)) return -EFAULT; /* Catch bogons. */ if (!CMSG_COMPAT_OK(ucmlen, ucmsg, kmsg)) return -EINVAL; tmp = ((ucmlen - sizeof(*ucmsg)) + sizeof(struct cmsghdr)); tmp = CMSG_ALIGN(tmp); kcmlen += tmp; ucmsg = cmsg_compat_nxthdr(kmsg, ucmsg, ucmlen); } if (kcmlen == 0) return -EINVAL; /* The kcmlen holds the 64-bit version of the control length. * It may not be modified as we do not stick it into the kmsg * until we have successfully copied over all of the data * from the user. */ if (kcmlen > stackbuf_size) kcmsg_base = kcmsg = sock_kmalloc(sk, kcmlen, GFP_KERNEL); if (kcmsg == NULL) return -ENOMEM; /* Now copy them over neatly. */ memset(kcmsg, 0, kcmlen); ucmsg = CMSG_COMPAT_FIRSTHDR(kmsg); while (ucmsg != NULL) { struct compat_cmsghdr cmsg; if (copy_from_user(&cmsg, ucmsg, sizeof(cmsg))) goto Efault; if (!CMSG_COMPAT_OK(cmsg.cmsg_len, ucmsg, kmsg)) goto Einval; tmp = ((cmsg.cmsg_len - sizeof(*ucmsg)) + sizeof(struct cmsghdr)); if ((char *)kcmsg_base + kcmlen - (char *)kcmsg < CMSG_ALIGN(tmp)) goto Einval; kcmsg->cmsg_len = tmp; kcmsg->cmsg_level = cmsg.cmsg_level; kcmsg->cmsg_type = cmsg.cmsg_type; tmp = CMSG_ALIGN(tmp); if (copy_from_user(CMSG_DATA(kcmsg), CMSG_COMPAT_DATA(ucmsg), (cmsg.cmsg_len - sizeof(*ucmsg)))) goto Efault; /* Advance. */ kcmsg = (struct cmsghdr *)((char *)kcmsg + tmp); ucmsg = cmsg_compat_nxthdr(kmsg, ucmsg, cmsg.cmsg_len); } /* * check the length of messages copied in is the same as the * what we get from the first loop */ if ((char *)kcmsg - (char *)kcmsg_base != kcmlen) goto Einval; /* Ok, looks like we made it. Hook it up and return success. */ kmsg->msg_control_is_user = false; kmsg->msg_control = kcmsg_base; kmsg->msg_controllen = kcmlen; return 0; Einval: err = -EINVAL; Efault: if (kcmsg_base != (struct cmsghdr *)stackbuf) sock_kfree_s(sk, kcmsg_base, kcmlen); return err; } int put_cmsg_compat(struct msghdr *kmsg, int level, int type, int len, void *data) { struct compat_cmsghdr __user *cm = (struct compat_cmsghdr __user *) kmsg->msg_control_user; struct compat_cmsghdr cmhdr; struct old_timeval32 ctv; struct old_timespec32 cts[3]; int cmlen; if (cm == NULL || kmsg->msg_controllen < sizeof(*cm)) { kmsg->msg_flags |= MSG_CTRUNC; return 0; /* XXX: return error? check spec. */ } if (!COMPAT_USE_64BIT_TIME) { if (level == SOL_SOCKET && type == SO_TIMESTAMP_OLD) { struct __kernel_old_timeval *tv = (struct __kernel_old_timeval *)data; ctv.tv_sec = tv->tv_sec; ctv.tv_usec = tv->tv_usec; data = &ctv; len = sizeof(ctv); } if (level == SOL_SOCKET && (type == SO_TIMESTAMPNS_OLD || type == SO_TIMESTAMPING_OLD)) { int count = type == SO_TIMESTAMPNS_OLD ? 1 : 3; int i; struct __kernel_old_timespec *ts = data; for (i = 0; i < count; i++) { cts[i].tv_sec = ts[i].tv_sec; cts[i].tv_nsec = ts[i].tv_nsec; } data = &cts; len = sizeof(cts[0]) * count; } } cmlen = CMSG_COMPAT_LEN(len); if (kmsg->msg_controllen < cmlen) { kmsg->msg_flags |= MSG_CTRUNC; cmlen = kmsg->msg_controllen; } cmhdr.cmsg_level = level; cmhdr.cmsg_type = type; cmhdr.cmsg_len = cmlen; if (copy_to_user(cm, &cmhdr, sizeof cmhdr)) return -EFAULT; if (copy_to_user(CMSG_COMPAT_DATA(cm), data, cmlen - sizeof(struct compat_cmsghdr))) return -EFAULT; cmlen = CMSG_COMPAT_SPACE(len); if (kmsg->msg_controllen < cmlen) cmlen = kmsg->msg_controllen; kmsg->msg_control_user += cmlen; kmsg->msg_controllen -= cmlen; return 0; } static int scm_max_fds_compat(struct msghdr *msg) { if (msg->msg_controllen <= sizeof(struct compat_cmsghdr)) return 0; return (msg->msg_controllen - sizeof(struct compat_cmsghdr)) / sizeof(int); } void scm_detach_fds_compat(struct msghdr *msg, struct scm_cookie *scm) { struct compat_cmsghdr __user *cm = (struct compat_cmsghdr __user *)msg->msg_control_user; unsigned int o_flags = (msg->msg_flags & MSG_CMSG_CLOEXEC) ? O_CLOEXEC : 0; int fdmax = min_t(int, scm_max_fds_compat(msg), scm->fp->count); int __user *cmsg_data = CMSG_COMPAT_DATA(cm); int err = 0, i; for (i = 0; i < fdmax; i++) { err = scm_recv_one_fd(scm->fp->fp[i], cmsg_data + i, o_flags); if (err < 0) break; } if (i > 0) { int cmlen = CMSG_COMPAT_LEN(i * sizeof(int)); err = put_user(SOL_SOCKET, &cm->cmsg_level); if (!err) err = put_user(SCM_RIGHTS, &cm->cmsg_type); if (!err) err = put_user(cmlen, &cm->cmsg_len); if (!err) { cmlen = CMSG_COMPAT_SPACE(i * sizeof(int)); if (msg->msg_controllen < cmlen) cmlen = msg->msg_controllen; msg->msg_control_user += cmlen; msg->msg_controllen -= cmlen; } } if (i < scm->fp->count || (scm->fp->count && fdmax <= 0)) msg->msg_flags |= MSG_CTRUNC; /* * All of the files that fit in the message have had their usage counts * incremented, so we just free the list. */ __scm_destroy(scm); } /* Argument list sizes for compat_sys_socketcall */ #define AL(x) ((x) * sizeof(u32)) static unsigned char nas[21] = { AL(0), AL(3), AL(3), AL(3), AL(2), AL(3), AL(3), AL(3), AL(4), AL(4), AL(4), AL(6), AL(6), AL(2), AL(5), AL(5), AL(3), AL(3), AL(4), AL(5), AL(4) }; #undef AL static inline long __compat_sys_sendmsg(int fd, struct compat_msghdr __user *msg, unsigned int flags) { return __sys_sendmsg(fd, (struct user_msghdr __user *)msg, flags | MSG_CMSG_COMPAT, false); } COMPAT_SYSCALL_DEFINE3(sendmsg, int, fd, struct compat_msghdr __user *, msg, unsigned int, flags) { return __compat_sys_sendmsg(fd, msg, flags); } static inline long __compat_sys_sendmmsg(int fd, struct compat_mmsghdr __user *mmsg, unsigned int vlen, unsigned int flags) { return __sys_sendmmsg(fd, (struct mmsghdr __user *)mmsg, vlen, flags | MSG_CMSG_COMPAT, false); } COMPAT_SYSCALL_DEFINE4(sendmmsg, int, fd, struct compat_mmsghdr __user *, mmsg, unsigned int, vlen, unsigned int, flags) { return __compat_sys_sendmmsg(fd, mmsg, vlen, flags); } static inline long __compat_sys_recvmsg(int fd, struct compat_msghdr __user *msg, unsigned int flags) { return __sys_recvmsg(fd, (struct user_msghdr __user *)msg, flags | MSG_CMSG_COMPAT, false); } COMPAT_SYSCALL_DEFINE3(recvmsg, int, fd, struct compat_msghdr __user *, msg, unsigned int, flags) { return __compat_sys_recvmsg(fd, msg, flags); } static inline long __compat_sys_recvfrom(int fd, void __user *buf, compat_size_t len, unsigned int flags, struct sockaddr __user *addr, int __user *addrlen) { return __sys_recvfrom(fd, buf, len, flags | MSG_CMSG_COMPAT, addr, addrlen); } COMPAT_SYSCALL_DEFINE4(recv, int, fd, void __user *, buf, compat_size_t, len, unsigned int, flags) { return __compat_sys_recvfrom(fd, buf, len, flags, NULL, NULL); } COMPAT_SYSCALL_DEFINE6(recvfrom, int, fd, void __user *, buf, compat_size_t, len, unsigned int, flags, struct sockaddr __user *, addr, int __user *, addrlen) { return __compat_sys_recvfrom(fd, buf, len, flags, addr, addrlen); } COMPAT_SYSCALL_DEFINE5(recvmmsg_time64, int, fd, struct compat_mmsghdr __user *, mmsg, unsigned int, vlen, unsigned int, flags, struct __kernel_timespec __user *, timeout) { return __sys_recvmmsg(fd, (struct mmsghdr __user *)mmsg, vlen, flags | MSG_CMSG_COMPAT, timeout, NULL); } #ifdef CONFIG_COMPAT_32BIT_TIME COMPAT_SYSCALL_DEFINE5(recvmmsg_time32, int, fd, struct compat_mmsghdr __user *, mmsg, unsigned int, vlen, unsigned int, flags, struct old_timespec32 __user *, timeout) { return __sys_recvmmsg(fd, (struct mmsghdr __user *)mmsg, vlen, flags | MSG_CMSG_COMPAT, NULL, timeout); } #endif COMPAT_SYSCALL_DEFINE2(socketcall, int, call, u32 __user *, args) { u32 a[AUDITSC_ARGS]; unsigned int len; u32 a0, a1; int ret; if (call < SYS_SOCKET || call > SYS_SENDMMSG) return -EINVAL; len = nas[call]; if (len > sizeof(a)) return -EINVAL; if (copy_from_user(a, args, len)) return -EFAULT; ret = audit_socketcall_compat(len / sizeof(a[0]), a); if (ret) return ret; a0 = a[0]; a1 = a[1]; switch (call) { case SYS_SOCKET: ret = __sys_socket(a0, a1, a[2]); break; case SYS_BIND: ret = __sys_bind(a0, compat_ptr(a1), a[2]); break; case SYS_CONNECT: ret = __sys_connect(a0, compat_ptr(a1), a[2]); break; case SYS_LISTEN: ret = __sys_listen(a0, a1); break; case SYS_ACCEPT: ret = __sys_accept4(a0, compat_ptr(a1), compat_ptr(a[2]), 0); break; case SYS_GETSOCKNAME: ret = __sys_getsockname(a0, compat_ptr(a1), compat_ptr(a[2])); break; case SYS_GETPEERNAME: ret = __sys_getpeername(a0, compat_ptr(a1), compat_ptr(a[2])); break; case SYS_SOCKETPAIR: ret = __sys_socketpair(a0, a1, a[2], compat_ptr(a[3])); break; case SYS_SEND: ret = __sys_sendto(a0, compat_ptr(a1), a[2], a[3], NULL, 0); break; case SYS_SENDTO: ret = __sys_sendto(a0, compat_ptr(a1), a[2], a[3], compat_ptr(a[4]), a[5]); break; case SYS_RECV: ret = __compat_sys_recvfrom(a0, compat_ptr(a1), a[2], a[3], NULL, NULL); break; case SYS_RECVFROM: ret = __compat_sys_recvfrom(a0, compat_ptr(a1), a[2], a[3], compat_ptr(a[4]), compat_ptr(a[5])); break; case SYS_SHUTDOWN: ret = __sys_shutdown(a0, a1); break; case SYS_SETSOCKOPT: ret = __sys_setsockopt(a0, a1, a[2], compat_ptr(a[3]), a[4]); break; case SYS_GETSOCKOPT: ret = __sys_getsockopt(a0, a1, a[2], compat_ptr(a[3]), compat_ptr(a[4])); break; case SYS_SENDMSG: ret = __compat_sys_sendmsg(a0, compat_ptr(a1), a[2]); break; case SYS_SENDMMSG: ret = __compat_sys_sendmmsg(a0, compat_ptr(a1), a[2], a[3]); break; case SYS_RECVMSG: ret = __compat_sys_recvmsg(a0, compat_ptr(a1), a[2]); break; case SYS_RECVMMSG: ret = __sys_recvmmsg(a0, compat_ptr(a1), a[2], a[3] | MSG_CMSG_COMPAT, NULL, compat_ptr(a[4])); break; case SYS_ACCEPT4: ret = __sys_accept4(a0, compat_ptr(a1), compat_ptr(a[2]), a[3]); break; default: ret = -EINVAL; break; } return ret; } |
| 13 13 13 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 | /* * linux/include/video/vga.h -- standard VGA chipset interaction * * Copyright 1999 Jeff Garzik <jgarzik@pobox.com> * * Copyright history from vga16fb.c: * Copyright 1999 Ben Pfaff and Petr Vandrovec * Based on VGA info at http://www.osdever.net/FreeVGA/home.htm * Based on VESA framebuffer (c) 1998 Gerd Knorr * * This file is subject to the terms and conditions of the GNU General * Public License. See the file COPYING in the main directory of this * archive for more details. * */ #ifndef __linux_video_vga_h__ #define __linux_video_vga_h__ #include <linux/types.h> #include <linux/io.h> #include <asm/vga.h> #include <asm/byteorder.h> #define VGA_FB_PHYS_BASE 0xA0000 /* VGA framebuffer I/O base */ #define VGA_FB_PHYS_SIZE 65536 /* VGA framebuffer I/O size */ /* Some of the code below is taken from SVGAlib. The original, unmodified copyright notice for that code is below. */ /* VGAlib version 1.2 - (c) 1993 Tommy Frandsen */ /* */ /* This library is free software; you can redistribute it and/or */ /* modify it without any restrictions. This library is distributed */ /* in the hope that it will be useful, but without any warranty. */ /* Multi-chipset support Copyright 1993 Harm Hanemaayer */ /* partially copyrighted (C) 1993 by Hartmut Schirmer */ /* VGA data register ports */ #define VGA_CRT_DC 0x3D5 /* CRT Controller Data Register - color emulation */ #define VGA_CRT_DM 0x3B5 /* CRT Controller Data Register - mono emulation */ #define VGA_ATT_R 0x3C1 /* Attribute Controller Data Read Register */ #define VGA_ATT_W 0x3C0 /* Attribute Controller Data Write Register */ #define VGA_GFX_D 0x3CF /* Graphics Controller Data Register */ #define VGA_SEQ_D 0x3C5 /* Sequencer Data Register */ #define VGA_MIS_R 0x3CC /* Misc Output Read Register */ #define VGA_MIS_W 0x3C2 /* Misc Output Write Register */ #define VGA_FTC_R 0x3CA /* Feature Control Read Register */ #define VGA_IS1_RC 0x3DA /* Input Status Register 1 - color emulation */ #define VGA_IS1_RM 0x3BA /* Input Status Register 1 - mono emulation */ #define VGA_PEL_D 0x3C9 /* PEL Data Register */ #define VGA_PEL_MSK 0x3C6 /* PEL mask register */ /* EGA-specific registers */ #define EGA_GFX_E0 0x3CC /* Graphics enable processor 0 */ #define EGA_GFX_E1 0x3CA /* Graphics enable processor 1 */ /* VGA index register ports */ #define VGA_CRT_IC 0x3D4 /* CRT Controller Index - color emulation */ #define VGA_CRT_IM 0x3B4 /* CRT Controller Index - mono emulation */ #define VGA_ATT_IW 0x3C0 /* Attribute Controller Index & Data Write Register */ #define VGA_GFX_I 0x3CE /* Graphics Controller Index */ #define VGA_SEQ_I 0x3C4 /* Sequencer Index */ #define VGA_PEL_IW 0x3C8 /* PEL Write Index */ #define VGA_PEL_IR 0x3C7 /* PEL Read Index */ /* standard VGA indexes max counts */ #define VGA_CRT_C 0x19 /* Number of CRT Controller Registers */ #define VGA_ATT_C 0x15 /* Number of Attribute Controller Registers */ #define VGA_GFX_C 0x09 /* Number of Graphics Controller Registers */ #define VGA_SEQ_C 0x05 /* Number of Sequencer Registers */ #define VGA_MIS_C 0x01 /* Number of Misc Output Register */ /* VGA misc register bit masks */ #define VGA_MIS_COLOR 0x01 #define VGA_MIS_ENB_MEM_ACCESS 0x02 #define VGA_MIS_DCLK_28322_720 0x04 #define VGA_MIS_ENB_PLL_LOAD (0x04 | 0x08) #define VGA_MIS_SEL_HIGH_PAGE 0x20 /* VGA CRT controller register indices */ #define VGA_CRTC_H_TOTAL 0 #define VGA_CRTC_H_DISP 1 #define VGA_CRTC_H_BLANK_START 2 #define VGA_CRTC_H_BLANK_END 3 #define VGA_CRTC_H_SYNC_START 4 #define VGA_CRTC_H_SYNC_END 5 #define VGA_CRTC_V_TOTAL 6 #define VGA_CRTC_OVERFLOW 7 #define VGA_CRTC_PRESET_ROW 8 #define VGA_CRTC_MAX_SCAN 9 #define VGA_CRTC_CURSOR_START 0x0A #define VGA_CRTC_CURSOR_END 0x0B #define VGA_CRTC_START_HI 0x0C #define VGA_CRTC_START_LO 0x0D #define VGA_CRTC_CURSOR_HI 0x0E #define VGA_CRTC_CURSOR_LO 0x0F #define VGA_CRTC_V_SYNC_START 0x10 #define VGA_CRTC_V_SYNC_END 0x11 #define VGA_CRTC_V_DISP_END 0x12 #define VGA_CRTC_OFFSET 0x13 #define VGA_CRTC_UNDERLINE 0x14 #define VGA_CRTC_V_BLANK_START 0x15 #define VGA_CRTC_V_BLANK_END 0x16 #define VGA_CRTC_MODE 0x17 #define VGA_CRTC_LINE_COMPARE 0x18 #define VGA_CRTC_REGS VGA_CRT_C /* VGA CRT controller bit masks */ #define VGA_CR11_LOCK_CR0_CR7 0x80 /* lock writes to CR0 - CR7 */ #define VGA_CR17_H_V_SIGNALS_ENABLED 0x80 /* VGA attribute controller register indices */ #define VGA_ATC_PALETTE0 0x00 #define VGA_ATC_PALETTE1 0x01 #define VGA_ATC_PALETTE2 0x02 #define VGA_ATC_PALETTE3 0x03 #define VGA_ATC_PALETTE4 0x04 #define VGA_ATC_PALETTE5 0x05 #define VGA_ATC_PALETTE6 0x06 #define VGA_ATC_PALETTE7 0x07 #define VGA_ATC_PALETTE8 0x08 #define VGA_ATC_PALETTE9 0x09 #define VGA_ATC_PALETTEA 0x0A #define VGA_ATC_PALETTEB 0x0B #define VGA_ATC_PALETTEC 0x0C #define VGA_ATC_PALETTED 0x0D #define VGA_ATC_PALETTEE 0x0E #define VGA_ATC_PALETTEF 0x0F #define VGA_ATC_MODE 0x10 #define VGA_ATC_OVERSCAN 0x11 #define VGA_ATC_PLANE_ENABLE 0x12 #define VGA_ATC_PEL 0x13 #define VGA_ATC_COLOR_PAGE 0x14 #define VGA_AR_ENABLE_DISPLAY 0x20 /* VGA sequencer register indices */ #define VGA_SEQ_RESET 0x00 #define VGA_SEQ_CLOCK_MODE 0x01 #define VGA_SEQ_PLANE_WRITE 0x02 #define VGA_SEQ_CHARACTER_MAP 0x03 #define VGA_SEQ_MEMORY_MODE 0x04 /* VGA sequencer register bit masks */ #define VGA_SR01_CHAR_CLK_8DOTS 0x01 /* bit 0: character clocks 8 dots wide are generated */ #define VGA_SR01_SCREEN_OFF 0x20 /* bit 5: Screen is off */ #define VGA_SR02_ALL_PLANES 0x0F /* bits 3-0: enable access to all planes */ #define VGA_SR04_EXT_MEM 0x02 /* bit 1: allows complete mem access to 256K */ #define VGA_SR04_SEQ_MODE 0x04 /* bit 2: directs system to use a sequential addressing mode */ #define VGA_SR04_CHN_4M 0x08 /* bit 3: selects modulo 4 addressing for CPU access to display memory */ /* VGA graphics controller register indices */ #define VGA_GFX_SR_VALUE 0x00 #define VGA_GFX_SR_ENABLE 0x01 #define VGA_GFX_COMPARE_VALUE 0x02 #define VGA_GFX_DATA_ROTATE 0x03 #define VGA_GFX_PLANE_READ 0x04 #define VGA_GFX_MODE 0x05 #define VGA_GFX_MISC 0x06 #define VGA_GFX_COMPARE_MASK 0x07 #define VGA_GFX_BIT_MASK 0x08 /* VGA graphics controller bit masks */ #define VGA_GR06_GRAPHICS_MODE 0x01 /* macro for composing an 8-bit VGA register index and value * into a single 16-bit quantity */ #define VGA_OUT16VAL(v, r) (((v) << 8) | (r)) /* decide whether we should enable the faster 16-bit VGA register writes */ #ifdef __LITTLE_ENDIAN #define VGA_OUTW_WRITE #endif /* VGA State Save and Restore */ #define VGA_SAVE_FONT0 1 /* save/restore plane 2 fonts */ #define VGA_SAVE_FONT1 2 /* save/restore plane 3 fonts */ #define VGA_SAVE_TEXT 4 /* save/restore plane 0/1 fonts */ #define VGA_SAVE_FONTS 7 /* save/restore all fonts */ #define VGA_SAVE_MODE 8 /* save/restore video mode */ #define VGA_SAVE_CMAP 16 /* save/restore color map/DAC */ struct vgastate { void __iomem *vgabase; /* mmio base, if supported */ unsigned long membase; /* VGA window base, 0 for default - 0xA000 */ __u32 memsize; /* VGA window size, 0 for default 64K */ __u32 flags; /* what state[s] to save (see VGA_SAVE_*) */ __u32 depth; /* current fb depth, not important */ __u32 num_attr; /* number of att registers, 0 for default */ __u32 num_crtc; /* number of crt registers, 0 for default */ __u32 num_gfx; /* number of gfx registers, 0 for default */ __u32 num_seq; /* number of seq registers, 0 for default */ void *vidstate; }; extern int save_vga(struct vgastate *state); extern int restore_vga(struct vgastate *state); static inline unsigned char vga_mm_r (void __iomem *regbase, unsigned short port) { return readb (regbase + port); } static inline void vga_mm_w (void __iomem *regbase, unsigned short port, unsigned char val) { writeb (val, regbase + port); } static inline void vga_mm_w_fast (void __iomem *regbase, unsigned short port, unsigned char reg, unsigned char val) { writew (VGA_OUT16VAL (val, reg), regbase + port); } /* * generic VGA port read/write */ #ifdef CONFIG_HAS_IOPORT static inline unsigned char vga_io_r (unsigned short port) { return inb_p(port); } static inline void vga_io_w (unsigned short port, unsigned char val) { outb_p(val, port); } static inline void vga_io_w_fast (unsigned short port, unsigned char reg, unsigned char val) { outw(VGA_OUT16VAL (val, reg), port); } static inline unsigned char vga_r (void __iomem *regbase, unsigned short port) { if (regbase) return vga_mm_r (regbase, port); else return vga_io_r (port); } static inline void vga_w (void __iomem *regbase, unsigned short port, unsigned char val) { if (regbase) vga_mm_w (regbase, port, val); else vga_io_w (port, val); } static inline void vga_w_fast (void __iomem *regbase, unsigned short port, unsigned char reg, unsigned char val) { if (regbase) vga_mm_w_fast (regbase, port, reg, val); else vga_io_w_fast (port, reg, val); } #else /* CONFIG_HAS_IOPORT */ static inline unsigned char vga_r (void __iomem *regbase, unsigned short port) { return vga_mm_r (regbase, port); } static inline void vga_w (void __iomem *regbase, unsigned short port, unsigned char val) { vga_mm_w (regbase, port, val); } static inline void vga_w_fast (void __iomem *regbase, unsigned short port, unsigned char reg, unsigned char val) { vga_mm_w_fast (regbase, port, reg, val); } #endif /* CONFIG_HAS_IOPORT */ /* * VGA CRTC register read/write */ static inline unsigned char vga_rcrt (void __iomem *regbase, unsigned char reg) { vga_w (regbase, VGA_CRT_IC, reg); return vga_r (regbase, VGA_CRT_DC); } static inline void vga_wcrt (void __iomem *regbase, unsigned char reg, unsigned char val) { #ifdef VGA_OUTW_WRITE vga_w_fast (regbase, VGA_CRT_IC, reg, val); #else vga_w (regbase, VGA_CRT_IC, reg); vga_w (regbase, VGA_CRT_DC, val); #endif /* VGA_OUTW_WRITE */ } #ifdef CONFIG_HAS_IOPORT static inline unsigned char vga_io_rcrt (unsigned char reg) { vga_io_w (VGA_CRT_IC, reg); return vga_io_r (VGA_CRT_DC); } static inline void vga_io_wcrt (unsigned char reg, unsigned char val) { #ifdef VGA_OUTW_WRITE vga_io_w_fast (VGA_CRT_IC, reg, val); #else vga_io_w (VGA_CRT_IC, reg); vga_io_w (VGA_CRT_DC, val); #endif /* VGA_OUTW_WRITE */ } #endif /* CONFIG_HAS_IOPORT */ static inline unsigned char vga_mm_rcrt (void __iomem *regbase, unsigned char reg) { vga_mm_w (regbase, VGA_CRT_IC, reg); return vga_mm_r (regbase, VGA_CRT_DC); } static inline void vga_mm_wcrt (void __iomem *regbase, unsigned char reg, unsigned char val) { #ifdef VGA_OUTW_WRITE vga_mm_w_fast (regbase, VGA_CRT_IC, reg, val); #else vga_mm_w (regbase, VGA_CRT_IC, reg); vga_mm_w (regbase, VGA_CRT_DC, val); #endif /* VGA_OUTW_WRITE */ } /* * VGA sequencer register read/write */ static inline unsigned char vga_rseq (void __iomem *regbase, unsigned char reg) { vga_w (regbase, VGA_SEQ_I, reg); return vga_r (regbase, VGA_SEQ_D); } static inline void vga_wseq (void __iomem *regbase, unsigned char reg, unsigned char val) { #ifdef VGA_OUTW_WRITE vga_w_fast (regbase, VGA_SEQ_I, reg, val); #else vga_w (regbase, VGA_SEQ_I, reg); vga_w (regbase, VGA_SEQ_D, val); #endif /* VGA_OUTW_WRITE */ } #ifdef CONFIG_HAS_IOPORT static inline unsigned char vga_io_rseq (unsigned char reg) { vga_io_w (VGA_SEQ_I, reg); return vga_io_r (VGA_SEQ_D); } static inline void vga_io_wseq (unsigned char reg, unsigned char val) { #ifdef VGA_OUTW_WRITE vga_io_w_fast (VGA_SEQ_I, reg, val); #else vga_io_w (VGA_SEQ_I, reg); vga_io_w (VGA_SEQ_D, val); #endif /* VGA_OUTW_WRITE */ } #endif /* CONFIG_HAS_IOPORT */ static inline unsigned char vga_mm_rseq (void __iomem *regbase, unsigned char reg) { vga_mm_w (regbase, VGA_SEQ_I, reg); return vga_mm_r (regbase, VGA_SEQ_D); } static inline void vga_mm_wseq (void __iomem *regbase, unsigned char reg, unsigned char val) { #ifdef VGA_OUTW_WRITE vga_mm_w_fast (regbase, VGA_SEQ_I, reg, val); #else vga_mm_w (regbase, VGA_SEQ_I, reg); vga_mm_w (regbase, VGA_SEQ_D, val); #endif /* VGA_OUTW_WRITE */ } /* * VGA graphics controller register read/write */ static inline unsigned char vga_rgfx (void __iomem *regbase, unsigned char reg) { vga_w (regbase, VGA_GFX_I, reg); return vga_r (regbase, VGA_GFX_D); } static inline void vga_wgfx (void __iomem *regbase, unsigned char reg, unsigned char val) { #ifdef VGA_OUTW_WRITE vga_w_fast (regbase, VGA_GFX_I, reg, val); #else vga_w (regbase, VGA_GFX_I, reg); vga_w (regbase, VGA_GFX_D, val); #endif /* VGA_OUTW_WRITE */ } #ifdef CONFIG_HAS_IOPORT static inline unsigned char vga_io_rgfx (unsigned char reg) { vga_io_w (VGA_GFX_I, reg); return vga_io_r (VGA_GFX_D); } static inline void vga_io_wgfx (unsigned char reg, unsigned char val) { #ifdef VGA_OUTW_WRITE vga_io_w_fast (VGA_GFX_I, reg, val); #else vga_io_w (VGA_GFX_I, reg); vga_io_w (VGA_GFX_D, val); #endif /* VGA_OUTW_WRITE */ } #endif /* CONFIG_HAS_IOPORT */ static inline unsigned char vga_mm_rgfx (void __iomem *regbase, unsigned char reg) { vga_mm_w (regbase, VGA_GFX_I, reg); return vga_mm_r (regbase, VGA_GFX_D); } static inline void vga_mm_wgfx (void __iomem *regbase, unsigned char reg, unsigned char val) { #ifdef VGA_OUTW_WRITE vga_mm_w_fast (regbase, VGA_GFX_I, reg, val); #else vga_mm_w (regbase, VGA_GFX_I, reg); vga_mm_w (regbase, VGA_GFX_D, val); #endif /* VGA_OUTW_WRITE */ } /* * VGA attribute controller register read/write */ static inline unsigned char vga_rattr (void __iomem *regbase, unsigned char reg) { vga_w (regbase, VGA_ATT_IW, reg); return vga_r (regbase, VGA_ATT_R); } static inline void vga_wattr (void __iomem *regbase, unsigned char reg, unsigned char val) { vga_w (regbase, VGA_ATT_IW, reg); vga_w (regbase, VGA_ATT_W, val); } #ifdef CONFIG_HAS_IOPORT static inline unsigned char vga_io_rattr (unsigned char reg) { vga_io_w (VGA_ATT_IW, reg); return vga_io_r (VGA_ATT_R); } static inline void vga_io_wattr (unsigned char reg, unsigned char val) { vga_io_w (VGA_ATT_IW, reg); vga_io_w (VGA_ATT_W, val); } #endif /* CONFIG_HAS_IOPORT */ static inline unsigned char vga_mm_rattr (void __iomem *regbase, unsigned char reg) { vga_mm_w (regbase, VGA_ATT_IW, reg); return vga_mm_r (regbase, VGA_ATT_R); } static inline void vga_mm_wattr (void __iomem *regbase, unsigned char reg, unsigned char val) { vga_mm_w (regbase, VGA_ATT_IW, reg); vga_mm_w (regbase, VGA_ATT_W, val); } #endif /* __linux_video_vga_h__ */ |
| 29 29 29 4706 4703 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | // SPDX-License-Identifier: GPL-2.0 /* * Block rq-qos policy for assigning an I/O priority class to requests. * * Using an rq-qos policy for assigning I/O priority class has two advantages * over using the ioprio_set() system call: * * - This policy is cgroup based so it has all the advantages of cgroups. * - While ioprio_set() does not affect page cache writeback I/O, this rq-qos * controller affects page cache writeback I/O for filesystems that support * assiociating a cgroup with writeback I/O. See also * Documentation/admin-guide/cgroup-v2.rst. */ #include <linux/blk-mq.h> #include <linux/blk_types.h> #include <linux/kernel.h> #include <linux/module.h> #include "blk-cgroup.h" #include "blk-ioprio.h" #include "blk-rq-qos.h" /** * enum prio_policy - I/O priority class policy. * @POLICY_NO_CHANGE: (default) do not modify the I/O priority class. * @POLICY_PROMOTE_TO_RT: modify no-IOPRIO_CLASS_RT to IOPRIO_CLASS_RT. * @POLICY_RESTRICT_TO_BE: modify IOPRIO_CLASS_NONE and IOPRIO_CLASS_RT into * IOPRIO_CLASS_BE. * @POLICY_ALL_TO_IDLE: change the I/O priority class into IOPRIO_CLASS_IDLE. * @POLICY_NONE_TO_RT: an alias for POLICY_PROMOTE_TO_RT. * * See also <linux/ioprio.h>. */ enum prio_policy { POLICY_NO_CHANGE = 0, POLICY_PROMOTE_TO_RT = 1, POLICY_RESTRICT_TO_BE = 2, POLICY_ALL_TO_IDLE = 3, POLICY_NONE_TO_RT = 4, }; static const char *policy_name[] = { [POLICY_NO_CHANGE] = "no-change", [POLICY_PROMOTE_TO_RT] = "promote-to-rt", [POLICY_RESTRICT_TO_BE] = "restrict-to-be", [POLICY_ALL_TO_IDLE] = "idle", [POLICY_NONE_TO_RT] = "none-to-rt", }; static struct blkcg_policy ioprio_policy; /** * struct ioprio_blkcg - Per cgroup data. * @cpd: blkcg_policy_data structure. * @prio_policy: One of the IOPRIO_CLASS_* values. See also <linux/ioprio.h>. */ struct ioprio_blkcg { struct blkcg_policy_data cpd; enum prio_policy prio_policy; }; static struct ioprio_blkcg *blkcg_to_ioprio_blkcg(struct blkcg *blkcg) { return container_of(blkcg_to_cpd(blkcg, &ioprio_policy), struct ioprio_blkcg, cpd); } static struct ioprio_blkcg * ioprio_blkcg_from_css(struct cgroup_subsys_state *css) { return blkcg_to_ioprio_blkcg(css_to_blkcg(css)); } static int ioprio_show_prio_policy(struct seq_file *sf, void *v) { struct ioprio_blkcg *blkcg = ioprio_blkcg_from_css(seq_css(sf)); seq_printf(sf, "%s\n", policy_name[blkcg->prio_policy]); return 0; } static ssize_t ioprio_set_prio_policy(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct ioprio_blkcg *blkcg = ioprio_blkcg_from_css(of_css(of)); int ret; if (off != 0) return -EIO; /* kernfs_fop_write_iter() terminates 'buf' with '\0'. */ ret = sysfs_match_string(policy_name, buf); if (ret < 0) return ret; blkcg->prio_policy = ret; return nbytes; } static struct blkcg_policy_data *ioprio_alloc_cpd(gfp_t gfp) { struct ioprio_blkcg *blkcg; blkcg = kzalloc(sizeof(*blkcg), gfp); if (!blkcg) return NULL; blkcg->prio_policy = POLICY_NO_CHANGE; return &blkcg->cpd; } static void ioprio_free_cpd(struct blkcg_policy_data *cpd) { struct ioprio_blkcg *blkcg = container_of(cpd, typeof(*blkcg), cpd); kfree(blkcg); } static struct cftype ioprio_files[] = { { .name = "prio.class", .seq_show = ioprio_show_prio_policy, .write = ioprio_set_prio_policy, }, { } /* sentinel */ }; static struct blkcg_policy ioprio_policy = { .dfl_cftypes = ioprio_files, .legacy_cftypes = ioprio_files, .cpd_alloc_fn = ioprio_alloc_cpd, .cpd_free_fn = ioprio_free_cpd, }; void blkcg_set_ioprio(struct bio *bio) { struct ioprio_blkcg *blkcg = blkcg_to_ioprio_blkcg(bio->bi_blkg->blkcg); u16 prio; if (!blkcg || blkcg->prio_policy == POLICY_NO_CHANGE) return; if (blkcg->prio_policy == POLICY_PROMOTE_TO_RT || blkcg->prio_policy == POLICY_NONE_TO_RT) { /* * For RT threads, the default priority level is 4 because * task_nice is 0. By promoting non-RT io-priority to RT-class * and default level 4, those requests that are already * RT-class but need a higher io-priority can use ioprio_set() * to achieve this. */ if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) != IOPRIO_CLASS_RT) bio->bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 4); return; } /* * Except for IOPRIO_CLASS_NONE, higher I/O priority numbers * correspond to a lower priority. Hence, the max_t() below selects * the lower priority of bi_ioprio and the cgroup I/O priority class. * If the bio I/O priority equals IOPRIO_CLASS_NONE, the cgroup I/O * priority is assigned to the bio. */ prio = max_t(u16, bio->bi_ioprio, IOPRIO_PRIO_VALUE(blkcg->prio_policy, 0)); if (prio > bio->bi_ioprio) bio->bi_ioprio = prio; } static int __init ioprio_init(void) { return blkcg_policy_register(&ioprio_policy); } static void __exit ioprio_exit(void) { blkcg_policy_unregister(&ioprio_policy); } module_init(ioprio_init); module_exit(ioprio_exit); |
| 158 154 154 4 145 145 145 145 145 145 145 145 145 1 1 1 1 1 1 1 7 9 4 16 2 2 2 1 1 1 1 2 1 3 2 2 4 4 3 3 3 4 7 7 2 2 3 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Copyright (c) 2016 Mellanox Technologies. All rights reserved. * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com> */ #include <net/genetlink.h> #include <net/sock.h> #include <trace/events/devlink.h> #include "devl_internal.h" struct devlink_fmsg_item { struct list_head list; int attrtype; u8 nla_type; u16 len; int value[]; }; struct devlink_fmsg { struct list_head item_list; int err; /* first error encountered on some devlink_fmsg_XXX() call */ bool putting_binary; /* This flag forces enclosing of binary data * in an array brackets. It forces using * of designated API: * devlink_fmsg_binary_pair_nest_start() * devlink_fmsg_binary_pair_nest_end() */ }; static struct devlink_fmsg *devlink_fmsg_alloc(void) { struct devlink_fmsg *fmsg; fmsg = kzalloc(sizeof(*fmsg), GFP_KERNEL); if (!fmsg) return NULL; INIT_LIST_HEAD(&fmsg->item_list); return fmsg; } static void devlink_fmsg_free(struct devlink_fmsg *fmsg) { struct devlink_fmsg_item *item, *tmp; list_for_each_entry_safe(item, tmp, &fmsg->item_list, list) { list_del(&item->list); kfree(item); } kfree(fmsg); } struct devlink_health_reporter { struct list_head list; void *priv; const struct devlink_health_reporter_ops *ops; struct devlink *devlink; struct devlink_port *devlink_port; struct devlink_fmsg *dump_fmsg; u64 graceful_period; bool auto_recover; bool auto_dump; u8 health_state; u64 dump_ts; u64 dump_real_ts; u64 error_count; u64 recovery_count; u64 last_recovery_ts; }; void * devlink_health_reporter_priv(struct devlink_health_reporter *reporter) { return reporter->priv; } EXPORT_SYMBOL_GPL(devlink_health_reporter_priv); static struct devlink_health_reporter * __devlink_health_reporter_find_by_name(struct list_head *reporter_list, const char *reporter_name) { struct devlink_health_reporter *reporter; list_for_each_entry(reporter, reporter_list, list) if (!strcmp(reporter->ops->name, reporter_name)) return reporter; return NULL; } static struct devlink_health_reporter * devlink_health_reporter_find_by_name(struct devlink *devlink, const char *reporter_name) { return __devlink_health_reporter_find_by_name(&devlink->reporter_list, reporter_name); } static struct devlink_health_reporter * devlink_port_health_reporter_find_by_name(struct devlink_port *devlink_port, const char *reporter_name) { return __devlink_health_reporter_find_by_name(&devlink_port->reporter_list, reporter_name); } static struct devlink_health_reporter * __devlink_health_reporter_create(struct devlink *devlink, const struct devlink_health_reporter_ops *ops, u64 graceful_period, void *priv) { struct devlink_health_reporter *reporter; if (WARN_ON(graceful_period && !ops->recover)) return ERR_PTR(-EINVAL); reporter = kzalloc(sizeof(*reporter), GFP_KERNEL); if (!reporter) return ERR_PTR(-ENOMEM); reporter->priv = priv; reporter->ops = ops; reporter->devlink = devlink; reporter->graceful_period = graceful_period; reporter->auto_recover = !!ops->recover; reporter->auto_dump = !!ops->dump; return reporter; } /** * devl_port_health_reporter_create() - create devlink health reporter for * specified port instance * * @port: devlink_port to which health reports will relate * @ops: devlink health reporter ops * @graceful_period: min time (in msec) between recovery attempts * @priv: driver priv pointer */ struct devlink_health_reporter * devl_port_health_reporter_create(struct devlink_port *port, const struct devlink_health_reporter_ops *ops, u64 graceful_period, void *priv) { struct devlink_health_reporter *reporter; devl_assert_locked(port->devlink); if (__devlink_health_reporter_find_by_name(&port->reporter_list, ops->name)) return ERR_PTR(-EEXIST); reporter = __devlink_health_reporter_create(port->devlink, ops, graceful_period, priv); if (IS_ERR(reporter)) return reporter; reporter->devlink_port = port; list_add_tail(&reporter->list, &port->reporter_list); return reporter; } EXPORT_SYMBOL_GPL(devl_port_health_reporter_create); struct devlink_health_reporter * devlink_port_health_reporter_create(struct devlink_port *port, const struct devlink_health_reporter_ops *ops, u64 graceful_period, void *priv) { struct devlink_health_reporter *reporter; struct devlink *devlink = port->devlink; devl_lock(devlink); reporter = devl_port_health_reporter_create(port, ops, graceful_period, priv); devl_unlock(devlink); return reporter; } EXPORT_SYMBOL_GPL(devlink_port_health_reporter_create); /** * devl_health_reporter_create - create devlink health reporter * * @devlink: devlink instance which the health reports will relate * @ops: devlink health reporter ops * @graceful_period: min time (in msec) between recovery attempts * @priv: driver priv pointer */ struct devlink_health_reporter * devl_health_reporter_create(struct devlink *devlink, const struct devlink_health_reporter_ops *ops, u64 graceful_period, void *priv) { struct devlink_health_reporter *reporter; devl_assert_locked(devlink); if (devlink_health_reporter_find_by_name(devlink, ops->name)) return ERR_PTR(-EEXIST); reporter = __devlink_health_reporter_create(devlink, ops, graceful_period, priv); if (IS_ERR(reporter)) return reporter; list_add_tail(&reporter->list, &devlink->reporter_list); return reporter; } EXPORT_SYMBOL_GPL(devl_health_reporter_create); struct devlink_health_reporter * devlink_health_reporter_create(struct devlink *devlink, const struct devlink_health_reporter_ops *ops, u64 graceful_period, void *priv) { struct devlink_health_reporter *reporter; devl_lock(devlink); reporter = devl_health_reporter_create(devlink, ops, graceful_period, priv); devl_unlock(devlink); return reporter; } EXPORT_SYMBOL_GPL(devlink_health_reporter_create); static void devlink_health_reporter_free(struct devlink_health_reporter *reporter) { if (reporter->dump_fmsg) devlink_fmsg_free(reporter->dump_fmsg); kfree(reporter); } /** * devl_health_reporter_destroy() - destroy devlink health reporter * * @reporter: devlink health reporter to destroy */ void devl_health_reporter_destroy(struct devlink_health_reporter *reporter) { devl_assert_locked(reporter->devlink); list_del(&reporter->list); devlink_health_reporter_free(reporter); } EXPORT_SYMBOL_GPL(devl_health_reporter_destroy); void devlink_health_reporter_destroy(struct devlink_health_reporter *reporter) { struct devlink *devlink = reporter->devlink; devl_lock(devlink); devl_health_reporter_destroy(reporter); devl_unlock(devlink); } EXPORT_SYMBOL_GPL(devlink_health_reporter_destroy); static int devlink_nl_health_reporter_fill(struct sk_buff *msg, struct devlink_health_reporter *reporter, enum devlink_command cmd, u32 portid, u32 seq, int flags) { struct devlink *devlink = reporter->devlink; struct nlattr *reporter_attr; void *hdr; hdr = genlmsg_put(msg, portid, seq, &devlink_nl_family, flags, cmd); if (!hdr) return -EMSGSIZE; if (devlink_nl_put_handle(msg, devlink)) goto genlmsg_cancel; if (reporter->devlink_port) { if (nla_put_u32(msg, DEVLINK_ATTR_PORT_INDEX, reporter->devlink_port->index)) goto genlmsg_cancel; } reporter_attr = nla_nest_start_noflag(msg, DEVLINK_ATTR_HEALTH_REPORTER); if (!reporter_attr) goto genlmsg_cancel; if (nla_put_string(msg, DEVLINK_ATTR_HEALTH_REPORTER_NAME, reporter->ops->name)) goto reporter_nest_cancel; if (nla_put_u8(msg, DEVLINK_ATTR_HEALTH_REPORTER_STATE, reporter->health_state)) goto reporter_nest_cancel; if (devlink_nl_put_u64(msg, DEVLINK_ATTR_HEALTH_REPORTER_ERR_COUNT, reporter->error_count)) goto reporter_nest_cancel; if (devlink_nl_put_u64(msg, DEVLINK_ATTR_HEALTH_REPORTER_RECOVER_COUNT, reporter->recovery_count)) goto reporter_nest_cancel; if (reporter->ops->recover && devlink_nl_put_u64(msg, DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD, reporter->graceful_period)) goto reporter_nest_cancel; if (reporter->ops->recover && nla_put_u8(msg, DEVLINK_ATTR_HEALTH_REPORTER_AUTO_RECOVER, reporter->auto_recover)) goto reporter_nest_cancel; if (reporter->dump_fmsg && devlink_nl_put_u64(msg, DEVLINK_ATTR_HEALTH_REPORTER_DUMP_TS, jiffies_to_msecs(reporter->dump_ts))) goto reporter_nest_cancel; if (reporter->dump_fmsg && devlink_nl_put_u64(msg, DEVLINK_ATTR_HEALTH_REPORTER_DUMP_TS_NS, reporter->dump_real_ts)) goto reporter_nest_cancel; if (reporter->ops->dump && nla_put_u8(msg, DEVLINK_ATTR_HEALTH_REPORTER_AUTO_DUMP, reporter->auto_dump)) goto reporter_nest_cancel; nla_nest_end(msg, reporter_attr); genlmsg_end(msg, hdr); return 0; reporter_nest_cancel: nla_nest_cancel(msg, reporter_attr); genlmsg_cancel: genlmsg_cancel(msg, hdr); return -EMSGSIZE; } static struct devlink_health_reporter * devlink_health_reporter_get_from_attrs(struct devlink *devlink, struct nlattr **attrs) { struct devlink_port *devlink_port; char *reporter_name; if (!attrs[DEVLINK_ATTR_HEALTH_REPORTER_NAME]) return NULL; reporter_name = nla_data(attrs[DEVLINK_ATTR_HEALTH_REPORTER_NAME]); devlink_port = devlink_port_get_from_attrs(devlink, attrs); if (IS_ERR(devlink_port)) return devlink_health_reporter_find_by_name(devlink, reporter_name); else return devlink_port_health_reporter_find_by_name(devlink_port, reporter_name); } static struct devlink_health_reporter * devlink_health_reporter_get_from_info(struct devlink *devlink, struct genl_info *info) { return devlink_health_reporter_get_from_attrs(devlink, info->attrs); } int devlink_nl_health_reporter_get_doit(struct sk_buff *skb, struct genl_info *info) { struct devlink *devlink = info->user_ptr[0]; struct devlink_health_reporter *reporter; struct sk_buff *msg; int err; reporter = devlink_health_reporter_get_from_info(devlink, info); if (!reporter) return -EINVAL; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return -ENOMEM; err = devlink_nl_health_reporter_fill(msg, reporter, DEVLINK_CMD_HEALTH_REPORTER_GET, info->snd_portid, info->snd_seq, 0); if (err) { nlmsg_free(msg); return err; } return genlmsg_reply(msg, info); } static int devlink_nl_health_reporter_get_dump_one(struct sk_buff *msg, struct devlink *devlink, struct netlink_callback *cb, int flags) { struct devlink_nl_dump_state *state = devlink_dump_state(cb); const struct genl_info *info = genl_info_dump(cb); struct devlink_health_reporter *reporter; unsigned long port_index_end = ULONG_MAX; struct nlattr **attrs = info->attrs; unsigned long port_index_start = 0; struct devlink_port *port; unsigned long port_index; int idx = 0; int err; if (attrs && attrs[DEVLINK_ATTR_PORT_INDEX]) { port_index_start = nla_get_u32(attrs[DEVLINK_ATTR_PORT_INDEX]); port_index_end = port_index_start; flags |= NLM_F_DUMP_FILTERED; goto per_port_dump; } list_for_each_entry(reporter, &devlink->reporter_list, list) { if (idx < state->idx) { idx++; continue; } err = devlink_nl_health_reporter_fill(msg, reporter, DEVLINK_CMD_HEALTH_REPORTER_GET, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, flags); if (err) { state->idx = idx; return err; } idx++; } per_port_dump: xa_for_each_range(&devlink->ports, port_index, port, port_index_start, port_index_end) { list_for_each_entry(reporter, &port->reporter_list, list) { if (idx < state->idx) { idx++; continue; } err = devlink_nl_health_reporter_fill(msg, reporter, DEVLINK_CMD_HEALTH_REPORTER_GET, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, flags); if (err) { state->idx = idx; return err; } idx++; } } return 0; } int devlink_nl_health_reporter_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb) { return devlink_nl_dumpit(skb, cb, devlink_nl_health_reporter_get_dump_one); } int devlink_nl_health_reporter_set_doit(struct sk_buff *skb, struct genl_info *info) { struct devlink *devlink = info->user_ptr[0]; struct devlink_health_reporter *reporter; reporter = devlink_health_reporter_get_from_info(devlink, info); if (!reporter) return -EINVAL; if (!reporter->ops->recover && (info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD] || info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_AUTO_RECOVER])) return -EOPNOTSUPP; if (!reporter->ops->dump && info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_AUTO_DUMP]) return -EOPNOTSUPP; if (info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD]) reporter->graceful_period = nla_get_u64(info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD]); if (info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_AUTO_RECOVER]) reporter->auto_recover = nla_get_u8(info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_AUTO_RECOVER]); if (info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_AUTO_DUMP]) reporter->auto_dump = nla_get_u8(info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_AUTO_DUMP]); return 0; } static void devlink_recover_notify(struct devlink_health_reporter *reporter, enum devlink_command cmd) { struct devlink *devlink = reporter->devlink; struct devlink_obj_desc desc; struct sk_buff *msg; int err; WARN_ON(cmd != DEVLINK_CMD_HEALTH_REPORTER_RECOVER); ASSERT_DEVLINK_REGISTERED(devlink); if (!devlink_nl_notify_need(devlink)) return; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return; err = devlink_nl_health_reporter_fill(msg, reporter, cmd, 0, 0, 0); if (err) { nlmsg_free(msg); return; } devlink_nl_obj_desc_init(&desc, devlink); if (reporter->devlink_port) devlink_nl_obj_desc_port_set(&desc, reporter->devlink_port); devlink_nl_notify_send_desc(devlink, msg, &desc); } void devlink_health_reporter_recovery_done(struct devlink_health_reporter *reporter) { reporter->recovery_count++; reporter->last_recovery_ts = jiffies; } EXPORT_SYMBOL_GPL(devlink_health_reporter_recovery_done); static int devlink_health_reporter_recover(struct devlink_health_reporter *reporter, void *priv_ctx, struct netlink_ext_ack *extack) { int err; if (reporter->health_state == DEVLINK_HEALTH_REPORTER_STATE_HEALTHY) return 0; if (!reporter->ops->recover) return -EOPNOTSUPP; err = reporter->ops->recover(reporter, priv_ctx, extack); if (err) return err; devlink_health_reporter_recovery_done(reporter); reporter->health_state = DEVLINK_HEALTH_REPORTER_STATE_HEALTHY; devlink_recover_notify(reporter, DEVLINK_CMD_HEALTH_REPORTER_RECOVER); return 0; } static void devlink_health_dump_clear(struct devlink_health_reporter *reporter) { if (!reporter->dump_fmsg) return; devlink_fmsg_free(reporter->dump_fmsg); reporter->dump_fmsg = NULL; } static int devlink_health_do_dump(struct devlink_health_reporter *reporter, void *priv_ctx, struct netlink_ext_ack *extack) { int err; if (!reporter->ops->dump) return 0; if (reporter->dump_fmsg) return 0; reporter->dump_fmsg = devlink_fmsg_alloc(); if (!reporter->dump_fmsg) return -ENOMEM; devlink_fmsg_obj_nest_start(reporter->dump_fmsg); err = reporter->ops->dump(reporter, reporter->dump_fmsg, priv_ctx, extack); if (err) goto dump_err; devlink_fmsg_obj_nest_end(reporter->dump_fmsg); err = reporter->dump_fmsg->err; if (err) goto dump_err; reporter->dump_ts = jiffies; reporter->dump_real_ts = ktime_get_real_ns(); return 0; dump_err: devlink_health_dump_clear(reporter); return err; } int devlink_health_report(struct devlink_health_reporter *reporter, const char *msg, void *priv_ctx) { enum devlink_health_reporter_state prev_health_state; struct devlink *devlink = reporter->devlink; unsigned long recover_ts_threshold; int ret; /* write a log message of the current error */ WARN_ON(!msg); trace_devlink_health_report(devlink, reporter->ops->name, msg); reporter->error_count++; prev_health_state = reporter->health_state; reporter->health_state = DEVLINK_HEALTH_REPORTER_STATE_ERROR; devlink_recover_notify(reporter, DEVLINK_CMD_HEALTH_REPORTER_RECOVER); /* abort if the previous error wasn't recovered */ recover_ts_threshold = reporter->last_recovery_ts + msecs_to_jiffies(reporter->graceful_period); if (reporter->auto_recover && (prev_health_state != DEVLINK_HEALTH_REPORTER_STATE_HEALTHY || (reporter->last_recovery_ts && reporter->recovery_count && time_is_after_jiffies(recover_ts_threshold)))) { trace_devlink_health_recover_aborted(devlink, reporter->ops->name, reporter->health_state, jiffies - reporter->last_recovery_ts); return -ECANCELED; } if (reporter->auto_dump) { devl_lock(devlink); /* store current dump of current error, for later analysis */ devlink_health_do_dump(reporter, priv_ctx, NULL); devl_unlock(devlink); } if (!reporter->auto_recover) return 0; devl_lock(devlink); ret = devlink_health_reporter_recover(reporter, priv_ctx, NULL); devl_unlock(devlink); return ret; } EXPORT_SYMBOL_GPL(devlink_health_report); void devlink_health_reporter_state_update(struct devlink_health_reporter *reporter, enum devlink_health_reporter_state state) { if (WARN_ON(state != DEVLINK_HEALTH_REPORTER_STATE_HEALTHY && state != DEVLINK_HEALTH_REPORTER_STATE_ERROR)) return; if (reporter->health_state == state) return; reporter->health_state = state; trace_devlink_health_reporter_state_update(reporter->devlink, reporter->ops->name, state); devlink_recover_notify(reporter, DEVLINK_CMD_HEALTH_REPORTER_RECOVER); } EXPORT_SYMBOL_GPL(devlink_health_reporter_state_update); int devlink_nl_health_reporter_recover_doit(struct sk_buff *skb, struct genl_info *info) { struct devlink *devlink = info->user_ptr[0]; struct devlink_health_reporter *reporter; reporter = devlink_health_reporter_get_from_info(devlink, info); if (!reporter) return -EINVAL; return devlink_health_reporter_recover(reporter, NULL, info->extack); } static void devlink_fmsg_err_if_binary(struct devlink_fmsg *fmsg) { if (!fmsg->err && fmsg->putting_binary) fmsg->err = -EINVAL; } static void devlink_fmsg_nest_common(struct devlink_fmsg *fmsg, int attrtype) { struct devlink_fmsg_item *item; if (fmsg->err) return; item = kzalloc(sizeof(*item), GFP_KERNEL); if (!item) { fmsg->err = -ENOMEM; return; } item->attrtype = attrtype; list_add_tail(&item->list, &fmsg->item_list); } void devlink_fmsg_obj_nest_start(struct devlink_fmsg *fmsg) { devlink_fmsg_err_if_binary(fmsg); devlink_fmsg_nest_common(fmsg, DEVLINK_ATTR_FMSG_OBJ_NEST_START); } EXPORT_SYMBOL_GPL(devlink_fmsg_obj_nest_start); static void devlink_fmsg_nest_end(struct devlink_fmsg *fmsg) { devlink_fmsg_err_if_binary(fmsg); devlink_fmsg_nest_common(fmsg, DEVLINK_ATTR_FMSG_NEST_END); } void devlink_fmsg_obj_nest_end(struct devlink_fmsg *fmsg) { devlink_fmsg_nest_end(fmsg); } EXPORT_SYMBOL_GPL(devlink_fmsg_obj_nest_end); #define DEVLINK_FMSG_MAX_SIZE (GENLMSG_DEFAULT_SIZE - GENL_HDRLEN - NLA_HDRLEN) static void devlink_fmsg_put_name(struct devlink_fmsg *fmsg, const char *name) { struct devlink_fmsg_item *item; devlink_fmsg_err_if_binary(fmsg); if (fmsg->err) return; if (strlen(name) + 1 > DEVLINK_FMSG_MAX_SIZE) { fmsg->err = -EMSGSIZE; return; } item = kzalloc(sizeof(*item) + strlen(name) + 1, GFP_KERNEL); if (!item) { fmsg->err = -ENOMEM; return; } item->nla_type = DEVLINK_VAR_ATTR_TYPE_NUL_STRING; item->len = strlen(name) + 1; item->attrtype = DEVLINK_ATTR_FMSG_OBJ_NAME; memcpy(&item->value, name, item->len); list_add_tail(&item->list, &fmsg->item_list); } void devlink_fmsg_pair_nest_start(struct devlink_fmsg *fmsg, const char *name) { devlink_fmsg_err_if_binary(fmsg); devlink_fmsg_nest_common(fmsg, DEVLINK_ATTR_FMSG_PAIR_NEST_START); devlink_fmsg_put_name(fmsg, name); } EXPORT_SYMBOL_GPL(devlink_fmsg_pair_nest_start); void devlink_fmsg_pair_nest_end(struct devlink_fmsg *fmsg) { devlink_fmsg_nest_end(fmsg); } EXPORT_SYMBOL_GPL(devlink_fmsg_pair_nest_end); void devlink_fmsg_arr_pair_nest_start(struct devlink_fmsg *fmsg, const char *name) { devlink_fmsg_pair_nest_start(fmsg, name); devlink_fmsg_nest_common(fmsg, DEVLINK_ATTR_FMSG_ARR_NEST_START); } EXPORT_SYMBOL_GPL(devlink_fmsg_arr_pair_nest_start); void devlink_fmsg_arr_pair_nest_end(struct devlink_fmsg *fmsg) { devlink_fmsg_nest_end(fmsg); devlink_fmsg_nest_end(fmsg); } EXPORT_SYMBOL_GPL(devlink_fmsg_arr_pair_nest_end); void devlink_fmsg_binary_pair_nest_start(struct devlink_fmsg *fmsg, const char *name) { devlink_fmsg_arr_pair_nest_start(fmsg, name); fmsg->putting_binary = true; } EXPORT_SYMBOL_GPL(devlink_fmsg_binary_pair_nest_start); void devlink_fmsg_binary_pair_nest_end(struct devlink_fmsg *fmsg) { if (fmsg->err) return; if (!fmsg->putting_binary) fmsg->err = -EINVAL; fmsg->putting_binary = false; devlink_fmsg_arr_pair_nest_end(fmsg); } EXPORT_SYMBOL_GPL(devlink_fmsg_binary_pair_nest_end); static void devlink_fmsg_put_value(struct devlink_fmsg *fmsg, const void *value, u16 value_len, u8 value_nla_type) { struct devlink_fmsg_item *item; if (fmsg->err) return; if (value_len > DEVLINK_FMSG_MAX_SIZE) { fmsg->err = -EMSGSIZE; return; } item = kzalloc(sizeof(*item) + value_len, GFP_KERNEL); if (!item) { fmsg->err = -ENOMEM; return; } item->nla_type = value_nla_type; item->len = value_len; item->attrtype = DEVLINK_ATTR_FMSG_OBJ_VALUE_DATA; memcpy(&item->value, value, item->len); list_add_tail(&item->list, &fmsg->item_list); } static void devlink_fmsg_bool_put(struct devlink_fmsg *fmsg, bool value) { devlink_fmsg_err_if_binary(fmsg); devlink_fmsg_put_value(fmsg, &value, sizeof(value), DEVLINK_VAR_ATTR_TYPE_FLAG); } static void devlink_fmsg_u8_put(struct devlink_fmsg *fmsg, u8 value) { devlink_fmsg_err_if_binary(fmsg); devlink_fmsg_put_value(fmsg, &value, sizeof(value), DEVLINK_VAR_ATTR_TYPE_U8); } void devlink_fmsg_u32_put(struct devlink_fmsg *fmsg, u32 value) { devlink_fmsg_err_if_binary(fmsg); devlink_fmsg_put_value(fmsg, &value, sizeof(value), DEVLINK_VAR_ATTR_TYPE_U32); } EXPORT_SYMBOL_GPL(devlink_fmsg_u32_put); static void devlink_fmsg_u64_put(struct devlink_fmsg *fmsg, u64 value) { devlink_fmsg_err_if_binary(fmsg); devlink_fmsg_put_value(fmsg, &value, sizeof(value), DEVLINK_VAR_ATTR_TYPE_U64); } void devlink_fmsg_string_put(struct devlink_fmsg *fmsg, const char *value) { devlink_fmsg_err_if_binary(fmsg); devlink_fmsg_put_value(fmsg, value, strlen(value) + 1, DEVLINK_VAR_ATTR_TYPE_NUL_STRING); } EXPORT_SYMBOL_GPL(devlink_fmsg_string_put); void devlink_fmsg_binary_put(struct devlink_fmsg *fmsg, const void *value, u16 value_len) { if (!fmsg->err && !fmsg->putting_binary) fmsg->err = -EINVAL; devlink_fmsg_put_value(fmsg, value, value_len, DEVLINK_VAR_ATTR_TYPE_BINARY); } EXPORT_SYMBOL_GPL(devlink_fmsg_binary_put); void devlink_fmsg_bool_pair_put(struct devlink_fmsg *fmsg, const char *name, bool value) { devlink_fmsg_pair_nest_start(fmsg, name); devlink_fmsg_bool_put(fmsg, value); devlink_fmsg_pair_nest_end(fmsg); } EXPORT_SYMBOL_GPL(devlink_fmsg_bool_pair_put); void devlink_fmsg_u8_pair_put(struct devlink_fmsg *fmsg, const char *name, u8 value) { devlink_fmsg_pair_nest_start(fmsg, name); devlink_fmsg_u8_put(fmsg, value); devlink_fmsg_pair_nest_end(fmsg); } EXPORT_SYMBOL_GPL(devlink_fmsg_u8_pair_put); void devlink_fmsg_u32_pair_put(struct devlink_fmsg *fmsg, const char *name, u32 value) { devlink_fmsg_pair_nest_start(fmsg, name); devlink_fmsg_u32_put(fmsg, value); devlink_fmsg_pair_nest_end(fmsg); } EXPORT_SYMBOL_GPL(devlink_fmsg_u32_pair_put); void devlink_fmsg_u64_pair_put(struct devlink_fmsg *fmsg, const char *name, u64 value) { devlink_fmsg_pair_nest_start(fmsg, name); devlink_fmsg_u64_put(fmsg, value); devlink_fmsg_pair_nest_end(fmsg); } EXPORT_SYMBOL_GPL(devlink_fmsg_u64_pair_put); void devlink_fmsg_string_pair_put(struct devlink_fmsg *fmsg, const char *name, const char *value) { devlink_fmsg_pair_nest_start(fmsg, name); devlink_fmsg_string_put(fmsg, value); devlink_fmsg_pair_nest_end(fmsg); } EXPORT_SYMBOL_GPL(devlink_fmsg_string_pair_put); void devlink_fmsg_binary_pair_put(struct devlink_fmsg *fmsg, const char *name, const void *value, u32 value_len) { u32 data_size; u32 offset; devlink_fmsg_binary_pair_nest_start(fmsg, name); for (offset = 0; offset < value_len; offset += data_size) { data_size = value_len - offset; if (data_size > DEVLINK_FMSG_MAX_SIZE) data_size = DEVLINK_FMSG_MAX_SIZE; devlink_fmsg_binary_put(fmsg, value + offset, data_size); } devlink_fmsg_binary_pair_nest_end(fmsg); fmsg->putting_binary = false; } EXPORT_SYMBOL_GPL(devlink_fmsg_binary_pair_put); static int devlink_fmsg_item_fill_data(struct devlink_fmsg_item *msg, struct sk_buff *skb) { int attrtype = DEVLINK_ATTR_FMSG_OBJ_VALUE_DATA; u8 tmp; switch (msg->nla_type) { case DEVLINK_VAR_ATTR_TYPE_FLAG: /* Always provide flag data, regardless of its value */ tmp = *(bool *)msg->value; return nla_put_u8(skb, attrtype, tmp); case DEVLINK_VAR_ATTR_TYPE_U8: return nla_put_u8(skb, attrtype, *(u8 *)msg->value); case DEVLINK_VAR_ATTR_TYPE_U32: return nla_put_u32(skb, attrtype, *(u32 *)msg->value); case DEVLINK_VAR_ATTR_TYPE_U64: return devlink_nl_put_u64(skb, attrtype, *(u64 *)msg->value); case DEVLINK_VAR_ATTR_TYPE_NUL_STRING: return nla_put_string(skb, attrtype, (char *)&msg->value); case DEVLINK_VAR_ATTR_TYPE_BINARY: return nla_put(skb, attrtype, msg->len, (void *)&msg->value); default: return -EINVAL; } } static int devlink_fmsg_prepare_skb(struct devlink_fmsg *fmsg, struct sk_buff *skb, int *start) { struct devlink_fmsg_item *item; struct nlattr *fmsg_nlattr; int err = 0; int i = 0; fmsg_nlattr = nla_nest_start_noflag(skb, DEVLINK_ATTR_FMSG); if (!fmsg_nlattr) return -EMSGSIZE; list_for_each_entry(item, &fmsg->item_list, list) { if (i < *start) { i++; continue; } switch (item->attrtype) { case DEVLINK_ATTR_FMSG_OBJ_NEST_START: case DEVLINK_ATTR_FMSG_PAIR_NEST_START: case DEVLINK_ATTR_FMSG_ARR_NEST_START: case DEVLINK_ATTR_FMSG_NEST_END: err = nla_put_flag(skb, item->attrtype); break; case DEVLINK_ATTR_FMSG_OBJ_VALUE_DATA: err = nla_put_u8(skb, DEVLINK_ATTR_FMSG_OBJ_VALUE_TYPE, item->nla_type); if (err) break; err = devlink_fmsg_item_fill_data(item, skb); break; case DEVLINK_ATTR_FMSG_OBJ_NAME: err = nla_put_string(skb, item->attrtype, (char *)&item->value); break; default: err = -EINVAL; break; } if (!err) *start = ++i; else break; } nla_nest_end(skb, fmsg_nlattr); return err; } static int devlink_fmsg_snd(struct devlink_fmsg *fmsg, struct genl_info *info, enum devlink_command cmd, int flags) { struct nlmsghdr *nlh; struct sk_buff *skb; bool last = false; int index = 0; void *hdr; int err; if (fmsg->err) return fmsg->err; while (!last) { int tmp_index = index; skb = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!skb) return -ENOMEM; hdr = genlmsg_put(skb, info->snd_portid, info->snd_seq, &devlink_nl_family, flags | NLM_F_MULTI, cmd); if (!hdr) { err = -EMSGSIZE; goto nla_put_failure; } err = devlink_fmsg_prepare_skb(fmsg, skb, &index); if (!err) last = true; else if (err != -EMSGSIZE || tmp_index == index) goto nla_put_failure; genlmsg_end(skb, hdr); err = genlmsg_reply(skb, info); if (err) return err; } skb = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!skb) return -ENOMEM; nlh = nlmsg_put(skb, info->snd_portid, info->snd_seq, NLMSG_DONE, 0, flags | NLM_F_MULTI); if (!nlh) { err = -EMSGSIZE; goto nla_put_failure; } return genlmsg_reply(skb, info); nla_put_failure: nlmsg_free(skb); return err; } static int devlink_fmsg_dumpit(struct devlink_fmsg *fmsg, struct sk_buff *skb, struct netlink_callback *cb, enum devlink_command cmd) { struct devlink_nl_dump_state *state = devlink_dump_state(cb); int index = state->idx; int tmp_index = index; void *hdr; int err; if (fmsg->err) return fmsg->err; hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, &devlink_nl_family, NLM_F_ACK | NLM_F_MULTI, cmd); if (!hdr) { err = -EMSGSIZE; goto nla_put_failure; } err = devlink_fmsg_prepare_skb(fmsg, skb, &index); if ((err && err != -EMSGSIZE) || tmp_index == index) goto nla_put_failure; state->idx = index; genlmsg_end(skb, hdr); return skb->len; nla_put_failure: genlmsg_cancel(skb, hdr); return err; } int devlink_nl_health_reporter_diagnose_doit(struct sk_buff *skb, struct genl_info *info) { struct devlink *devlink = info->user_ptr[0]; struct devlink_health_reporter *reporter; struct devlink_fmsg *fmsg; int err; reporter = devlink_health_reporter_get_from_info(devlink, info); if (!reporter) return -EINVAL; if (!reporter->ops->diagnose) return -EOPNOTSUPP; fmsg = devlink_fmsg_alloc(); if (!fmsg) return -ENOMEM; devlink_fmsg_obj_nest_start(fmsg); err = reporter->ops->diagnose(reporter, fmsg, info->extack); if (err) goto out; devlink_fmsg_obj_nest_end(fmsg); err = devlink_fmsg_snd(fmsg, info, DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE, 0); out: devlink_fmsg_free(fmsg); return err; } static struct devlink_health_reporter * devlink_health_reporter_get_from_cb_lock(struct netlink_callback *cb) { const struct genl_info *info = genl_info_dump(cb); struct devlink_health_reporter *reporter; struct nlattr **attrs = info->attrs; struct devlink *devlink; devlink = devlink_get_from_attrs_lock(sock_net(cb->skb->sk), attrs, false); if (IS_ERR(devlink)) return NULL; reporter = devlink_health_reporter_get_from_attrs(devlink, attrs); if (!reporter) { devl_unlock(devlink); devlink_put(devlink); } return reporter; } int devlink_nl_health_reporter_dump_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb) { struct devlink_nl_dump_state *state = devlink_dump_state(cb); struct devlink_health_reporter *reporter; struct devlink *devlink; int err; reporter = devlink_health_reporter_get_from_cb_lock(cb); if (!reporter) return -EINVAL; devlink = reporter->devlink; if (!reporter->ops->dump) { devl_unlock(devlink); devlink_put(devlink); return -EOPNOTSUPP; } if (!state->idx) { err = devlink_health_do_dump(reporter, NULL, cb->extack); if (err) goto unlock; state->dump_ts = reporter->dump_ts; } if (!reporter->dump_fmsg || state->dump_ts != reporter->dump_ts) { NL_SET_ERR_MSG(cb->extack, "Dump trampled, please retry"); err = -EAGAIN; goto unlock; } err = devlink_fmsg_dumpit(reporter->dump_fmsg, skb, cb, DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET); unlock: devl_unlock(devlink); devlink_put(devlink); return err; } int devlink_nl_health_reporter_dump_clear_doit(struct sk_buff *skb, struct genl_info *info) { struct devlink *devlink = info->user_ptr[0]; struct devlink_health_reporter *reporter; reporter = devlink_health_reporter_get_from_info(devlink, info); if (!reporter) return -EINVAL; if (!reporter->ops->dump) return -EOPNOTSUPP; devlink_health_dump_clear(reporter); return 0; } int devlink_nl_health_reporter_test_doit(struct sk_buff *skb, struct genl_info *info) { struct devlink *devlink = info->user_ptr[0]; struct devlink_health_reporter *reporter; reporter = devlink_health_reporter_get_from_info(devlink, info); if (!reporter) return -EINVAL; if (!reporter->ops->test) return -EOPNOTSUPP; return reporter->ops->test(reporter, info->extack); } /** * devlink_fmsg_dump_skb - Dump sk_buffer structure * @fmsg: devlink formatted message pointer * @skb: pointer to skb * * Dump diagnostic information about sk_buff structure, like headroom, length, * tailroom, MAC, etc. */ void devlink_fmsg_dump_skb(struct devlink_fmsg *fmsg, const struct sk_buff *skb) { struct skb_shared_info *sh = skb_shinfo(skb); struct sock *sk = skb->sk; bool has_mac, has_trans; has_mac = skb_mac_header_was_set(skb); has_trans = skb_transport_header_was_set(skb); devlink_fmsg_pair_nest_start(fmsg, "skb"); devlink_fmsg_obj_nest_start(fmsg); devlink_fmsg_put(fmsg, "actual len", skb->len); devlink_fmsg_put(fmsg, "head len", skb_headlen(skb)); devlink_fmsg_put(fmsg, "data len", skb->data_len); devlink_fmsg_put(fmsg, "tail len", skb_tailroom(skb)); devlink_fmsg_put(fmsg, "MAC", has_mac ? skb->mac_header : -1); devlink_fmsg_put(fmsg, "MAC len", has_mac ? skb_mac_header_len(skb) : -1); devlink_fmsg_put(fmsg, "network hdr", skb->network_header); devlink_fmsg_put(fmsg, "network hdr len", has_trans ? skb_network_header_len(skb) : -1); devlink_fmsg_put(fmsg, "transport hdr", has_trans ? skb->transport_header : -1); devlink_fmsg_put(fmsg, "csum", (__force u32)skb->csum); devlink_fmsg_put(fmsg, "csum_ip_summed", (u8)skb->ip_summed); devlink_fmsg_put(fmsg, "csum_complete_sw", !!skb->csum_complete_sw); devlink_fmsg_put(fmsg, "csum_valid", !!skb->csum_valid); devlink_fmsg_put(fmsg, "csum_level", (u8)skb->csum_level); devlink_fmsg_put(fmsg, "sw_hash", !!skb->sw_hash); devlink_fmsg_put(fmsg, "l4_hash", !!skb->l4_hash); devlink_fmsg_put(fmsg, "proto", ntohs(skb->protocol)); devlink_fmsg_put(fmsg, "pkt_type", (u8)skb->pkt_type); devlink_fmsg_put(fmsg, "iif", skb->skb_iif); if (sk) { devlink_fmsg_pair_nest_start(fmsg, "sk"); devlink_fmsg_obj_nest_start(fmsg); devlink_fmsg_put(fmsg, "family", sk->sk_type); devlink_fmsg_put(fmsg, "type", sk->sk_type); devlink_fmsg_put(fmsg, "proto", sk->sk_protocol); devlink_fmsg_obj_nest_end(fmsg); devlink_fmsg_pair_nest_end(fmsg); } devlink_fmsg_obj_nest_end(fmsg); devlink_fmsg_pair_nest_end(fmsg); devlink_fmsg_pair_nest_start(fmsg, "shinfo"); devlink_fmsg_obj_nest_start(fmsg); devlink_fmsg_put(fmsg, "tx_flags", sh->tx_flags); devlink_fmsg_put(fmsg, "nr_frags", sh->nr_frags); devlink_fmsg_put(fmsg, "gso_size", sh->gso_size); devlink_fmsg_put(fmsg, "gso_type", sh->gso_type); devlink_fmsg_put(fmsg, "gso_segs", sh->gso_segs); devlink_fmsg_obj_nest_end(fmsg); devlink_fmsg_pair_nest_end(fmsg); } EXPORT_SYMBOL_GPL(devlink_fmsg_dump_skb); |
| 3 123 123 8 4 1 3 56 10 6 5 102 1 101 10 27 69 64 9 98 98 98 98 9 7 3 4 7 3 1 2 57 2 50 6 5 56 40 6 24 5 4 41 41 39 2 16 12 7 8 84 83 79 3 2 80 84 74 2 73 3 1 3 2 1 2 66 2 50 38 66 40 48 168 1 1 1 2 133 1 1 28 1 9 9 152 1 134 17 53 1 80 200 3 197 99 120 120 2 2 1 84 86 80 79 2 2 1 1 1 1 1 3 3 2 2 4 3 1 1 1 3 51 1 2 1 2 1 1 2 2 2 1 2 1 108 2 1 34 3 3 41 84 57 557 559 351 140 18 18 18 3 15 4 5 5 7 1 7 5 1 1 3 3 40 154 114 63 2 111 1 93 80 213 142 71 117 96 40 173 242 243 101 243 242 51 48 105 176 8 7 8 7 143 143 63 63 40 24 33 5 44 5 10 14 233 231 1 27 233 53 250 251 98 98 98 7 101 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 | // SPDX-License-Identifier: GPL-2.0-or-later /* * History: * Started: Aug 9 by Lawrence Foard (entropy@world.std.com), * to allow user process control of SCSI devices. * Development Sponsored by Killy Corp. NY NY * * Original driver (sg.c): * Copyright (C) 1992 Lawrence Foard * Version 2 and 3 extensions to driver: * Copyright (C) 1998 - 2014 Douglas Gilbert */ static int sg_version_num = 30536; /* 2 digits for each component */ #define SG_VERSION_STR "3.5.36" /* * D. P. Gilbert (dgilbert@interlog.com), notes: * - scsi logging is available via SCSI_LOG_TIMEOUT macros. First * the kernel/module needs to be built with CONFIG_SCSI_LOGGING * (otherwise the macros compile to empty statements). * */ #include <linux/module.h> #include <linux/fs.h> #include <linux/kernel.h> #include <linux/sched.h> #include <linux/string.h> #include <linux/mm.h> #include <linux/errno.h> #include <linux/mtio.h> #include <linux/ioctl.h> #include <linux/major.h> #include <linux/slab.h> #include <linux/fcntl.h> #include <linux/init.h> #include <linux/poll.h> #include <linux/moduleparam.h> #include <linux/cdev.h> #include <linux/idr.h> #include <linux/seq_file.h> #include <linux/blkdev.h> #include <linux/delay.h> #include <linux/blktrace_api.h> #include <linux/mutex.h> #include <linux/atomic.h> #include <linux/ratelimit.h> #include <linux/uio.h> #include <linux/cred.h> /* for sg_check_file_access() */ #include <scsi/scsi.h> #include <scsi/scsi_cmnd.h> #include <scsi/scsi_dbg.h> #include <scsi/scsi_device.h> #include <scsi/scsi_driver.h> #include <scsi/scsi_eh.h> #include <scsi/scsi_host.h> #include <scsi/scsi_ioctl.h> #include <scsi/scsi_tcq.h> #include <scsi/sg.h> #include "scsi_logging.h" #ifdef CONFIG_SCSI_PROC_FS #include <linux/proc_fs.h> static char *sg_version_date = "20140603"; static int sg_proc_init(void); #endif #define SG_ALLOW_DIO_DEF 0 #define SG_MAX_DEVS (1 << MINORBITS) /* SG_MAX_CDB_SIZE should be 260 (spc4r37 section 3.1.30) however the type * of sg_io_hdr::cmd_len can only represent 255. All SCSI commands greater * than 16 bytes are "variable length" whose length is a multiple of 4 */ #define SG_MAX_CDB_SIZE 252 #define SG_DEFAULT_TIMEOUT mult_frac(SG_DEFAULT_TIMEOUT_USER, HZ, USER_HZ) static int sg_big_buff = SG_DEF_RESERVED_SIZE; /* N.B. This variable is readable and writeable via /proc/scsi/sg/def_reserved_size . Each time sg_open() is called a buffer of this size (or less if there is not enough memory) will be reserved for use by this file descriptor. [Deprecated usage: this variable is also readable via /proc/sys/kernel/sg-big-buff if the sg driver is built into the kernel (i.e. it is not a module).] */ static int def_reserved_size = -1; /* picks up init parameter */ static int sg_allow_dio = SG_ALLOW_DIO_DEF; static int scatter_elem_sz = SG_SCATTER_SZ; static int scatter_elem_sz_prev = SG_SCATTER_SZ; #define SG_SECTOR_SZ 512 static int sg_add_device(struct device *); static void sg_remove_device(struct device *); static DEFINE_IDR(sg_index_idr); static DEFINE_RWLOCK(sg_index_lock); /* Also used to lock file descriptor list for device */ static struct class_interface sg_interface = { .add_dev = sg_add_device, .remove_dev = sg_remove_device, }; typedef struct sg_scatter_hold { /* holding area for scsi scatter gather info */ unsigned short k_use_sg; /* Count of kernel scatter-gather pieces */ unsigned sglist_len; /* size of malloc'd scatter-gather list ++ */ unsigned bufflen; /* Size of (aggregate) data buffer */ struct page **pages; int page_order; char dio_in_use; /* 0->indirect IO (or mmap), 1->dio */ unsigned char cmd_opcode; /* first byte of command */ } Sg_scatter_hold; struct sg_device; /* forward declarations */ struct sg_fd; typedef struct sg_request { /* SG_MAX_QUEUE requests outstanding per file */ struct list_head entry; /* list entry */ struct sg_fd *parentfp; /* NULL -> not in use */ Sg_scatter_hold data; /* hold buffer, perhaps scatter list */ sg_io_hdr_t header; /* scsi command+info, see <scsi/sg.h> */ unsigned char sense_b[SCSI_SENSE_BUFFERSIZE]; char res_used; /* 1 -> using reserve buffer, 0 -> not ... */ char orphan; /* 1 -> drop on sight, 0 -> normal */ char sg_io_owned; /* 1 -> packet belongs to SG_IO */ /* done protected by rq_list_lock */ char done; /* 0->before bh, 1->before read, 2->read */ struct request *rq; struct bio *bio; struct execute_work ew; } Sg_request; typedef struct sg_fd { /* holds the state of a file descriptor */ struct list_head sfd_siblings; /* protected by device's sfd_lock */ struct sg_device *parentdp; /* owning device */ wait_queue_head_t read_wait; /* queue read until command done */ rwlock_t rq_list_lock; /* protect access to list in req_arr */ struct mutex f_mutex; /* protect against changes in this fd */ int timeout; /* defaults to SG_DEFAULT_TIMEOUT */ int timeout_user; /* defaults to SG_DEFAULT_TIMEOUT_USER */ Sg_scatter_hold reserve; /* buffer held for this file descriptor */ struct list_head rq_list; /* head of request list */ struct fasync_struct *async_qp; /* used by asynchronous notification */ Sg_request req_arr[SG_MAX_QUEUE]; /* used as singly-linked list */ char force_packid; /* 1 -> pack_id input to read(), 0 -> ignored */ char cmd_q; /* 1 -> allow command queuing, 0 -> don't */ unsigned char next_cmd_len; /* 0: automatic, >0: use on next write() */ char keep_orphan; /* 0 -> drop orphan (def), 1 -> keep for read() */ char mmap_called; /* 0 -> mmap() never called on this fd */ char res_in_use; /* 1 -> 'reserve' array in use */ struct kref f_ref; struct execute_work ew; } Sg_fd; typedef struct sg_device { /* holds the state of each scsi generic device */ struct scsi_device *device; wait_queue_head_t open_wait; /* queue open() when O_EXCL present */ struct mutex open_rel_lock; /* held when in open() or release() */ int sg_tablesize; /* adapter's max scatter-gather table size */ u32 index; /* device index number */ struct list_head sfds; rwlock_t sfd_lock; /* protect access to sfd list */ atomic_t detaching; /* 0->device usable, 1->device detaching */ bool exclude; /* 1->open(O_EXCL) succeeded and is active */ int open_cnt; /* count of opens (perhaps < num(sfds) ) */ char sgdebug; /* 0->off, 1->sense, 9->dump dev, 10-> all devs */ char name[DISK_NAME_LEN]; struct cdev * cdev; /* char_dev [sysfs: /sys/cdev/major/sg<n>] */ struct kref d_ref; } Sg_device; /* tasklet or soft irq callback */ static enum rq_end_io_ret sg_rq_end_io(struct request *rq, blk_status_t status); static int sg_start_req(Sg_request *srp, unsigned char *cmd); static int sg_finish_rem_req(Sg_request * srp); static int sg_build_indirect(Sg_scatter_hold * schp, Sg_fd * sfp, int buff_size); static ssize_t sg_new_read(Sg_fd * sfp, char __user *buf, size_t count, Sg_request * srp); static ssize_t sg_new_write(Sg_fd *sfp, struct file *file, const char __user *buf, size_t count, int blocking, int read_only, int sg_io_owned, Sg_request **o_srp); static int sg_common_write(Sg_fd * sfp, Sg_request * srp, unsigned char *cmnd, int timeout, int blocking); static int sg_read_oxfer(Sg_request * srp, char __user *outp, int num_read_xfer); static void sg_remove_scat(Sg_fd * sfp, Sg_scatter_hold * schp); static void sg_build_reserve(Sg_fd * sfp, int req_size); static void sg_link_reserve(Sg_fd * sfp, Sg_request * srp, int size); static void sg_unlink_reserve(Sg_fd * sfp, Sg_request * srp); static Sg_fd *sg_add_sfp(Sg_device * sdp); static void sg_remove_sfp(struct kref *); static Sg_request *sg_get_rq_mark(Sg_fd * sfp, int pack_id, bool *busy); static Sg_request *sg_add_request(Sg_fd * sfp); static int sg_remove_request(Sg_fd * sfp, Sg_request * srp); static Sg_device *sg_get_dev(int dev); static void sg_device_destroy(struct kref *kref); #define SZ_SG_HEADER sizeof(struct sg_header) #define SZ_SG_IO_HDR sizeof(sg_io_hdr_t) #define SZ_SG_IOVEC sizeof(sg_iovec_t) #define SZ_SG_REQ_INFO sizeof(sg_req_info_t) #define sg_printk(prefix, sdp, fmt, a...) \ sdev_prefix_printk(prefix, (sdp)->device, (sdp)->name, fmt, ##a) /* * The SCSI interfaces that use read() and write() as an asynchronous variant of * ioctl(..., SG_IO, ...) are fundamentally unsafe, since there are lots of ways * to trigger read() and write() calls from various contexts with elevated * privileges. This can lead to kernel memory corruption (e.g. if these * interfaces are called through splice()) and privilege escalation inside * userspace (e.g. if a process with access to such a device passes a file * descriptor to a SUID binary as stdin/stdout/stderr). * * This function provides protection for the legacy API by restricting the * calling context. */ static int sg_check_file_access(struct file *filp, const char *caller) { if (filp->f_cred != current_real_cred()) { pr_err_once("%s: process %d (%s) changed security contexts after opening file descriptor, this is not allowed.\n", caller, task_tgid_vnr(current), current->comm); return -EPERM; } return 0; } static int sg_allow_access(struct file *filp, unsigned char *cmd) { struct sg_fd *sfp = filp->private_data; if (sfp->parentdp->device->type == TYPE_SCANNER) return 0; if (!scsi_cmd_allowed(cmd, filp->f_mode & FMODE_WRITE)) return -EPERM; return 0; } static int open_wait(Sg_device *sdp, int flags) { int retval = 0; if (flags & O_EXCL) { while (sdp->open_cnt > 0) { mutex_unlock(&sdp->open_rel_lock); retval = wait_event_interruptible(sdp->open_wait, (atomic_read(&sdp->detaching) || !sdp->open_cnt)); mutex_lock(&sdp->open_rel_lock); if (retval) /* -ERESTARTSYS */ return retval; if (atomic_read(&sdp->detaching)) return -ENODEV; } } else { while (sdp->exclude) { mutex_unlock(&sdp->open_rel_lock); retval = wait_event_interruptible(sdp->open_wait, (atomic_read(&sdp->detaching) || !sdp->exclude)); mutex_lock(&sdp->open_rel_lock); if (retval) /* -ERESTARTSYS */ return retval; if (atomic_read(&sdp->detaching)) return -ENODEV; } } return retval; } /* Returns 0 on success, else a negated errno value */ static int sg_open(struct inode *inode, struct file *filp) { int dev = iminor(inode); int flags = filp->f_flags; struct request_queue *q; struct scsi_device *device; Sg_device *sdp; Sg_fd *sfp; int retval; nonseekable_open(inode, filp); if ((flags & O_EXCL) && (O_RDONLY == (flags & O_ACCMODE))) return -EPERM; /* Can't lock it with read only access */ sdp = sg_get_dev(dev); if (IS_ERR(sdp)) return PTR_ERR(sdp); SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_open: flags=0x%x\n", flags)); /* This driver's module count bumped by fops_get in <linux/fs.h> */ /* Prevent the device driver from vanishing while we sleep */ device = sdp->device; retval = scsi_device_get(device); if (retval) goto sg_put; /* scsi_block_when_processing_errors() may block so bypass * check if O_NONBLOCK. Permits SCSI commands to be issued * during error recovery. Tread carefully. */ if (!((flags & O_NONBLOCK) || scsi_block_when_processing_errors(device))) { retval = -ENXIO; /* we are in error recovery for this device */ goto sdp_put; } mutex_lock(&sdp->open_rel_lock); if (flags & O_NONBLOCK) { if (flags & O_EXCL) { if (sdp->open_cnt > 0) { retval = -EBUSY; goto error_mutex_locked; } } else { if (sdp->exclude) { retval = -EBUSY; goto error_mutex_locked; } } } else { retval = open_wait(sdp, flags); if (retval) /* -ERESTARTSYS or -ENODEV */ goto error_mutex_locked; } /* N.B. at this point we are holding the open_rel_lock */ if (flags & O_EXCL) sdp->exclude = true; if (sdp->open_cnt < 1) { /* no existing opens */ sdp->sgdebug = 0; q = device->request_queue; sdp->sg_tablesize = queue_max_segments(q); } sfp = sg_add_sfp(sdp); if (IS_ERR(sfp)) { retval = PTR_ERR(sfp); goto out_undo; } filp->private_data = sfp; sdp->open_cnt++; mutex_unlock(&sdp->open_rel_lock); retval = 0; sg_put: kref_put(&sdp->d_ref, sg_device_destroy); return retval; out_undo: if (flags & O_EXCL) { sdp->exclude = false; /* undo if error */ wake_up_interruptible(&sdp->open_wait); } error_mutex_locked: mutex_unlock(&sdp->open_rel_lock); sdp_put: kref_put(&sdp->d_ref, sg_device_destroy); scsi_device_put(device); return retval; } /* Release resources associated with a successful sg_open() * Returns 0 on success, else a negated errno value */ static int sg_release(struct inode *inode, struct file *filp) { Sg_device *sdp; Sg_fd *sfp; if ((!(sfp = (Sg_fd *) filp->private_data)) || (!(sdp = sfp->parentdp))) return -ENXIO; SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_release\n")); mutex_lock(&sdp->open_rel_lock); sdp->open_cnt--; /* possibly many open()s waiting on exlude clearing, start many; * only open(O_EXCL)s wait on 0==open_cnt so only start one */ if (sdp->exclude) { sdp->exclude = false; wake_up_interruptible_all(&sdp->open_wait); } else if (0 == sdp->open_cnt) { wake_up_interruptible(&sdp->open_wait); } mutex_unlock(&sdp->open_rel_lock); kref_put(&sfp->f_ref, sg_remove_sfp); return 0; } static int get_sg_io_pack_id(int *pack_id, void __user *buf, size_t count) { struct sg_header __user *old_hdr = buf; int reply_len; if (count >= SZ_SG_HEADER) { /* negative reply_len means v3 format, otherwise v1/v2 */ if (get_user(reply_len, &old_hdr->reply_len)) return -EFAULT; if (reply_len >= 0) return get_user(*pack_id, &old_hdr->pack_id); if (in_compat_syscall() && count >= sizeof(struct compat_sg_io_hdr)) { struct compat_sg_io_hdr __user *hp = buf; return get_user(*pack_id, &hp->pack_id); } if (count >= sizeof(struct sg_io_hdr)) { struct sg_io_hdr __user *hp = buf; return get_user(*pack_id, &hp->pack_id); } } /* no valid header was passed, so ignore the pack_id */ *pack_id = -1; return 0; } static ssize_t sg_read(struct file *filp, char __user *buf, size_t count, loff_t * ppos) { Sg_device *sdp; Sg_fd *sfp; Sg_request *srp; int req_pack_id = -1; bool busy; sg_io_hdr_t *hp; struct sg_header *old_hdr; int retval; /* * This could cause a response to be stranded. Close the associated * file descriptor to free up any resources being held. */ retval = sg_check_file_access(filp, __func__); if (retval) return retval; if ((!(sfp = (Sg_fd *) filp->private_data)) || (!(sdp = sfp->parentdp))) return -ENXIO; SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_read: count=%d\n", (int) count)); if (sfp->force_packid) retval = get_sg_io_pack_id(&req_pack_id, buf, count); if (retval) return retval; srp = sg_get_rq_mark(sfp, req_pack_id, &busy); if (!srp) { /* now wait on packet to arrive */ if (filp->f_flags & O_NONBLOCK) return -EAGAIN; retval = wait_event_interruptible(sfp->read_wait, ((srp = sg_get_rq_mark(sfp, req_pack_id, &busy)) || (!busy && atomic_read(&sdp->detaching)))); if (!srp) /* signal or detaching */ return retval ? retval : -ENODEV; } if (srp->header.interface_id != '\0') return sg_new_read(sfp, buf, count, srp); hp = &srp->header; old_hdr = kzalloc(SZ_SG_HEADER, GFP_KERNEL); if (!old_hdr) return -ENOMEM; old_hdr->reply_len = (int) hp->timeout; old_hdr->pack_len = old_hdr->reply_len; /* old, strange behaviour */ old_hdr->pack_id = hp->pack_id; old_hdr->twelve_byte = ((srp->data.cmd_opcode >= 0xc0) && (12 == hp->cmd_len)) ? 1 : 0; old_hdr->target_status = hp->masked_status; old_hdr->host_status = hp->host_status; old_hdr->driver_status = hp->driver_status; if ((CHECK_CONDITION & hp->masked_status) || (srp->sense_b[0] & 0x70) == 0x70) { old_hdr->driver_status = DRIVER_SENSE; memcpy(old_hdr->sense_buffer, srp->sense_b, sizeof (old_hdr->sense_buffer)); } switch (hp->host_status) { /* This setup of 'result' is for backward compatibility and is best ignored by the user who should use target, host + driver status */ case DID_OK: case DID_PASSTHROUGH: case DID_SOFT_ERROR: old_hdr->result = 0; break; case DID_NO_CONNECT: case DID_BUS_BUSY: case DID_TIME_OUT: old_hdr->result = EBUSY; break; case DID_BAD_TARGET: case DID_ABORT: case DID_PARITY: case DID_RESET: case DID_BAD_INTR: old_hdr->result = EIO; break; case DID_ERROR: old_hdr->result = (srp->sense_b[0] == 0 && hp->masked_status == GOOD) ? 0 : EIO; break; default: old_hdr->result = EIO; break; } /* Now copy the result back to the user buffer. */ if (count >= SZ_SG_HEADER) { if (copy_to_user(buf, old_hdr, SZ_SG_HEADER)) { retval = -EFAULT; goto free_old_hdr; } buf += SZ_SG_HEADER; if (count > old_hdr->reply_len) count = old_hdr->reply_len; if (count > SZ_SG_HEADER) { if (sg_read_oxfer(srp, buf, count - SZ_SG_HEADER)) { retval = -EFAULT; goto free_old_hdr; } } } else count = (old_hdr->result == 0) ? 0 : -EIO; sg_finish_rem_req(srp); sg_remove_request(sfp, srp); retval = count; free_old_hdr: kfree(old_hdr); return retval; } static ssize_t sg_new_read(Sg_fd * sfp, char __user *buf, size_t count, Sg_request * srp) { sg_io_hdr_t *hp = &srp->header; int err = 0, err2; int len; if (in_compat_syscall()) { if (count < sizeof(struct compat_sg_io_hdr)) { err = -EINVAL; goto err_out; } } else if (count < SZ_SG_IO_HDR) { err = -EINVAL; goto err_out; } hp->sb_len_wr = 0; if ((hp->mx_sb_len > 0) && hp->sbp) { if ((CHECK_CONDITION & hp->masked_status) || (srp->sense_b[0] & 0x70) == 0x70) { int sb_len = SCSI_SENSE_BUFFERSIZE; sb_len = (hp->mx_sb_len > sb_len) ? sb_len : hp->mx_sb_len; len = 8 + (int) srp->sense_b[7]; /* Additional sense length field */ len = (len > sb_len) ? sb_len : len; if (copy_to_user(hp->sbp, srp->sense_b, len)) { err = -EFAULT; goto err_out; } hp->driver_status = DRIVER_SENSE; hp->sb_len_wr = len; } } if (hp->masked_status || hp->host_status || hp->driver_status) hp->info |= SG_INFO_CHECK; err = put_sg_io_hdr(hp, buf); err_out: err2 = sg_finish_rem_req(srp); sg_remove_request(sfp, srp); return err ? : err2 ? : count; } static ssize_t sg_write(struct file *filp, const char __user *buf, size_t count, loff_t * ppos) { int mxsize, cmd_size, k; int input_size, blocking; unsigned char opcode; Sg_device *sdp; Sg_fd *sfp; Sg_request *srp; struct sg_header old_hdr; sg_io_hdr_t *hp; unsigned char cmnd[SG_MAX_CDB_SIZE]; int retval; retval = sg_check_file_access(filp, __func__); if (retval) return retval; if ((!(sfp = (Sg_fd *) filp->private_data)) || (!(sdp = sfp->parentdp))) return -ENXIO; SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_write: count=%d\n", (int) count)); if (atomic_read(&sdp->detaching)) return -ENODEV; if (!((filp->f_flags & O_NONBLOCK) || scsi_block_when_processing_errors(sdp->device))) return -ENXIO; if (count < SZ_SG_HEADER) return -EIO; if (copy_from_user(&old_hdr, buf, SZ_SG_HEADER)) return -EFAULT; blocking = !(filp->f_flags & O_NONBLOCK); if (old_hdr.reply_len < 0) return sg_new_write(sfp, filp, buf, count, blocking, 0, 0, NULL); if (count < (SZ_SG_HEADER + 6)) return -EIO; /* The minimum scsi command length is 6 bytes. */ buf += SZ_SG_HEADER; if (get_user(opcode, buf)) return -EFAULT; if (!(srp = sg_add_request(sfp))) { SCSI_LOG_TIMEOUT(1, sg_printk(KERN_INFO, sdp, "sg_write: queue full\n")); return -EDOM; } mutex_lock(&sfp->f_mutex); if (sfp->next_cmd_len > 0) { cmd_size = sfp->next_cmd_len; sfp->next_cmd_len = 0; /* reset so only this write() effected */ } else { cmd_size = COMMAND_SIZE(opcode); /* based on SCSI command group */ if ((opcode >= 0xc0) && old_hdr.twelve_byte) cmd_size = 12; } mutex_unlock(&sfp->f_mutex); SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, sdp, "sg_write: scsi opcode=0x%02x, cmd_size=%d\n", (int) opcode, cmd_size)); /* Determine buffer size. */ input_size = count - cmd_size; mxsize = (input_size > old_hdr.reply_len) ? input_size : old_hdr.reply_len; mxsize -= SZ_SG_HEADER; input_size -= SZ_SG_HEADER; if (input_size < 0) { sg_remove_request(sfp, srp); return -EIO; /* User did not pass enough bytes for this command. */ } hp = &srp->header; hp->interface_id = '\0'; /* indicator of old interface tunnelled */ hp->cmd_len = (unsigned char) cmd_size; hp->iovec_count = 0; hp->mx_sb_len = 0; if (input_size > 0) hp->dxfer_direction = (old_hdr.reply_len > SZ_SG_HEADER) ? SG_DXFER_TO_FROM_DEV : SG_DXFER_TO_DEV; else hp->dxfer_direction = (mxsize > 0) ? SG_DXFER_FROM_DEV : SG_DXFER_NONE; hp->dxfer_len = mxsize; if ((hp->dxfer_direction == SG_DXFER_TO_DEV) || (hp->dxfer_direction == SG_DXFER_TO_FROM_DEV)) hp->dxferp = (char __user *)buf + cmd_size; else hp->dxferp = NULL; hp->sbp = NULL; hp->timeout = old_hdr.reply_len; /* structure abuse ... */ hp->flags = input_size; /* structure abuse ... */ hp->pack_id = old_hdr.pack_id; hp->usr_ptr = NULL; if (copy_from_user(cmnd, buf, cmd_size)) { sg_remove_request(sfp, srp); return -EFAULT; } /* * SG_DXFER_TO_FROM_DEV is functionally equivalent to SG_DXFER_FROM_DEV, * but is is possible that the app intended SG_DXFER_TO_DEV, because there * is a non-zero input_size, so emit a warning. */ if (hp->dxfer_direction == SG_DXFER_TO_FROM_DEV) { printk_ratelimited(KERN_WARNING "sg_write: data in/out %d/%d bytes " "for SCSI command 0x%x-- guessing " "data in;\n program %s not setting " "count and/or reply_len properly\n", old_hdr.reply_len - (int)SZ_SG_HEADER, input_size, (unsigned int) cmnd[0], current->comm); } k = sg_common_write(sfp, srp, cmnd, sfp->timeout, blocking); return (k < 0) ? k : count; } static ssize_t sg_new_write(Sg_fd *sfp, struct file *file, const char __user *buf, size_t count, int blocking, int read_only, int sg_io_owned, Sg_request **o_srp) { int k; Sg_request *srp; sg_io_hdr_t *hp; unsigned char cmnd[SG_MAX_CDB_SIZE]; int timeout; unsigned long ul_timeout; if (count < SZ_SG_IO_HDR) return -EINVAL; sfp->cmd_q = 1; /* when sg_io_hdr seen, set command queuing on */ if (!(srp = sg_add_request(sfp))) { SCSI_LOG_TIMEOUT(1, sg_printk(KERN_INFO, sfp->parentdp, "sg_new_write: queue full\n")); return -EDOM; } srp->sg_io_owned = sg_io_owned; hp = &srp->header; if (get_sg_io_hdr(hp, buf)) { sg_remove_request(sfp, srp); return -EFAULT; } if (hp->interface_id != 'S') { sg_remove_request(sfp, srp); return -ENOSYS; } if (hp->flags & SG_FLAG_MMAP_IO) { if (hp->dxfer_len > sfp->reserve.bufflen) { sg_remove_request(sfp, srp); return -ENOMEM; /* MMAP_IO size must fit in reserve buffer */ } if (hp->flags & SG_FLAG_DIRECT_IO) { sg_remove_request(sfp, srp); return -EINVAL; /* either MMAP_IO or DIRECT_IO (not both) */ } if (sfp->res_in_use) { sg_remove_request(sfp, srp); return -EBUSY; /* reserve buffer already being used */ } } ul_timeout = msecs_to_jiffies(srp->header.timeout); timeout = (ul_timeout < INT_MAX) ? ul_timeout : INT_MAX; if ((!hp->cmdp) || (hp->cmd_len < 6) || (hp->cmd_len > sizeof (cmnd))) { sg_remove_request(sfp, srp); return -EMSGSIZE; } if (copy_from_user(cmnd, hp->cmdp, hp->cmd_len)) { sg_remove_request(sfp, srp); return -EFAULT; } if (read_only && sg_allow_access(file, cmnd)) { sg_remove_request(sfp, srp); return -EPERM; } k = sg_common_write(sfp, srp, cmnd, timeout, blocking); if (k < 0) return k; if (o_srp) *o_srp = srp; return count; } static int sg_common_write(Sg_fd * sfp, Sg_request * srp, unsigned char *cmnd, int timeout, int blocking) { int k, at_head; Sg_device *sdp = sfp->parentdp; sg_io_hdr_t *hp = &srp->header; srp->data.cmd_opcode = cmnd[0]; /* hold opcode of command */ hp->status = 0; hp->masked_status = 0; hp->msg_status = 0; hp->info = 0; hp->host_status = 0; hp->driver_status = 0; hp->resid = 0; SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, sfp->parentdp, "sg_common_write: scsi opcode=0x%02x, cmd_size=%d\n", (int) cmnd[0], (int) hp->cmd_len)); if (hp->dxfer_len >= SZ_256M) { sg_remove_request(sfp, srp); return -EINVAL; } k = sg_start_req(srp, cmnd); if (k) { SCSI_LOG_TIMEOUT(1, sg_printk(KERN_INFO, sfp->parentdp, "sg_common_write: start_req err=%d\n", k)); sg_finish_rem_req(srp); sg_remove_request(sfp, srp); return k; /* probably out of space --> ENOMEM */ } if (atomic_read(&sdp->detaching)) { if (srp->bio) { blk_mq_free_request(srp->rq); srp->rq = NULL; } sg_finish_rem_req(srp); sg_remove_request(sfp, srp); return -ENODEV; } hp->duration = jiffies_to_msecs(jiffies); if (hp->interface_id != '\0' && /* v3 (or later) interface */ (SG_FLAG_Q_AT_TAIL & hp->flags)) at_head = 0; else at_head = 1; srp->rq->timeout = timeout; kref_get(&sfp->f_ref); /* sg_rq_end_io() does kref_put(). */ srp->rq->end_io = sg_rq_end_io; blk_execute_rq_nowait(srp->rq, at_head); return 0; } static int srp_done(Sg_fd *sfp, Sg_request *srp) { unsigned long flags; int ret; read_lock_irqsave(&sfp->rq_list_lock, flags); ret = srp->done; read_unlock_irqrestore(&sfp->rq_list_lock, flags); return ret; } static int max_sectors_bytes(struct request_queue *q) { unsigned int max_sectors = queue_max_sectors(q); max_sectors = min_t(unsigned int, max_sectors, INT_MAX >> 9); return max_sectors << 9; } static void sg_fill_request_table(Sg_fd *sfp, sg_req_info_t *rinfo) { Sg_request *srp; int val; unsigned int ms; val = 0; list_for_each_entry(srp, &sfp->rq_list, entry) { if (val >= SG_MAX_QUEUE) break; rinfo[val].req_state = srp->done + 1; rinfo[val].problem = srp->header.masked_status & srp->header.host_status & srp->header.driver_status; if (srp->done) rinfo[val].duration = srp->header.duration; else { ms = jiffies_to_msecs(jiffies); rinfo[val].duration = (ms > srp->header.duration) ? (ms - srp->header.duration) : 0; } rinfo[val].orphan = srp->orphan; rinfo[val].sg_io_owned = srp->sg_io_owned; rinfo[val].pack_id = srp->header.pack_id; rinfo[val].usr_ptr = srp->header.usr_ptr; val++; } } #ifdef CONFIG_COMPAT struct compat_sg_req_info { /* used by SG_GET_REQUEST_TABLE ioctl() */ char req_state; char orphan; char sg_io_owned; char problem; int pack_id; compat_uptr_t usr_ptr; unsigned int duration; int unused; }; static int put_compat_request_table(struct compat_sg_req_info __user *o, struct sg_req_info *rinfo) { int i; for (i = 0; i < SG_MAX_QUEUE; i++) { if (copy_to_user(o + i, rinfo + i, offsetof(sg_req_info_t, usr_ptr)) || put_user((uintptr_t)rinfo[i].usr_ptr, &o[i].usr_ptr) || put_user(rinfo[i].duration, &o[i].duration) || put_user(rinfo[i].unused, &o[i].unused)) return -EFAULT; } return 0; } #endif static long sg_ioctl_common(struct file *filp, Sg_device *sdp, Sg_fd *sfp, unsigned int cmd_in, void __user *p) { int __user *ip = p; int result, val, read_only; Sg_request *srp; unsigned long iflags; SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_ioctl: cmd=0x%x\n", (int) cmd_in)); read_only = (O_RDWR != (filp->f_flags & O_ACCMODE)); switch (cmd_in) { case SG_IO: if (atomic_read(&sdp->detaching)) return -ENODEV; if (!scsi_block_when_processing_errors(sdp->device)) return -ENXIO; result = sg_new_write(sfp, filp, p, SZ_SG_IO_HDR, 1, read_only, 1, &srp); if (result < 0) return result; result = wait_event_interruptible(sfp->read_wait, srp_done(sfp, srp)); write_lock_irq(&sfp->rq_list_lock); if (srp->done) { srp->done = 2; write_unlock_irq(&sfp->rq_list_lock); result = sg_new_read(sfp, p, SZ_SG_IO_HDR, srp); return (result < 0) ? result : 0; } srp->orphan = 1; write_unlock_irq(&sfp->rq_list_lock); return result; /* -ERESTARTSYS because signal hit process */ case SG_SET_TIMEOUT: result = get_user(val, ip); if (result) return result; if (val < 0) return -EIO; if (val >= mult_frac((s64)INT_MAX, USER_HZ, HZ)) val = min_t(s64, mult_frac((s64)INT_MAX, USER_HZ, HZ), INT_MAX); sfp->timeout_user = val; sfp->timeout = mult_frac(val, HZ, USER_HZ); return 0; case SG_GET_TIMEOUT: /* N.B. User receives timeout as return value */ /* strange ..., for backward compatibility */ return sfp->timeout_user; case SG_SET_FORCE_LOW_DMA: /* * N.B. This ioctl never worked properly, but failed to * return an error value. So returning '0' to keep compability * with legacy applications. */ return 0; case SG_GET_LOW_DMA: return put_user(0, ip); case SG_GET_SCSI_ID: { sg_scsi_id_t v; if (atomic_read(&sdp->detaching)) return -ENODEV; memset(&v, 0, sizeof(v)); v.host_no = sdp->device->host->host_no; v.channel = sdp->device->channel; v.scsi_id = sdp->device->id; v.lun = sdp->device->lun; v.scsi_type = sdp->device->type; v.h_cmd_per_lun = sdp->device->host->cmd_per_lun; v.d_queue_depth = sdp->device->queue_depth; if (copy_to_user(p, &v, sizeof(sg_scsi_id_t))) return -EFAULT; return 0; } case SG_SET_FORCE_PACK_ID: result = get_user(val, ip); if (result) return result; sfp->force_packid = val ? 1 : 0; return 0; case SG_GET_PACK_ID: read_lock_irqsave(&sfp->rq_list_lock, iflags); list_for_each_entry(srp, &sfp->rq_list, entry) { if ((1 == srp->done) && (!srp->sg_io_owned)) { read_unlock_irqrestore(&sfp->rq_list_lock, iflags); return put_user(srp->header.pack_id, ip); } } read_unlock_irqrestore(&sfp->rq_list_lock, iflags); return put_user(-1, ip); case SG_GET_NUM_WAITING: read_lock_irqsave(&sfp->rq_list_lock, iflags); val = 0; list_for_each_entry(srp, &sfp->rq_list, entry) { if ((1 == srp->done) && (!srp->sg_io_owned)) ++val; } read_unlock_irqrestore(&sfp->rq_list_lock, iflags); return put_user(val, ip); case SG_GET_SG_TABLESIZE: return put_user(sdp->sg_tablesize, ip); case SG_SET_RESERVED_SIZE: result = get_user(val, ip); if (result) return result; if (val < 0) return -EINVAL; val = min_t(int, val, max_sectors_bytes(sdp->device->request_queue)); mutex_lock(&sfp->f_mutex); if (val != sfp->reserve.bufflen) { if (sfp->mmap_called || sfp->res_in_use) { mutex_unlock(&sfp->f_mutex); return -EBUSY; } sg_remove_scat(sfp, &sfp->reserve); sg_build_reserve(sfp, val); } mutex_unlock(&sfp->f_mutex); return 0; case SG_GET_RESERVED_SIZE: val = min_t(int, sfp->reserve.bufflen, max_sectors_bytes(sdp->device->request_queue)); return put_user(val, ip); case SG_SET_COMMAND_Q: result = get_user(val, ip); if (result) return result; sfp->cmd_q = val ? 1 : 0; return 0; case SG_GET_COMMAND_Q: return put_user((int) sfp->cmd_q, ip); case SG_SET_KEEP_ORPHAN: result = get_user(val, ip); if (result) return result; sfp->keep_orphan = val; return 0; case SG_GET_KEEP_ORPHAN: return put_user((int) sfp->keep_orphan, ip); case SG_NEXT_CMD_LEN: result = get_user(val, ip); if (result) return result; if (val > SG_MAX_CDB_SIZE) return -ENOMEM; sfp->next_cmd_len = (val > 0) ? val : 0; return 0; case SG_GET_VERSION_NUM: return put_user(sg_version_num, ip); case SG_GET_ACCESS_COUNT: /* faked - we don't have a real access count anymore */ val = (sdp->device ? 1 : 0); return put_user(val, ip); case SG_GET_REQUEST_TABLE: { sg_req_info_t *rinfo; rinfo = kcalloc(SG_MAX_QUEUE, SZ_SG_REQ_INFO, GFP_KERNEL); if (!rinfo) return -ENOMEM; read_lock_irqsave(&sfp->rq_list_lock, iflags); sg_fill_request_table(sfp, rinfo); read_unlock_irqrestore(&sfp->rq_list_lock, iflags); #ifdef CONFIG_COMPAT if (in_compat_syscall()) result = put_compat_request_table(p, rinfo); else #endif result = copy_to_user(p, rinfo, SZ_SG_REQ_INFO * SG_MAX_QUEUE); result = result ? -EFAULT : 0; kfree(rinfo); return result; } case SG_EMULATED_HOST: if (atomic_read(&sdp->detaching)) return -ENODEV; return put_user(sdp->device->host->hostt->emulated, ip); case SCSI_IOCTL_SEND_COMMAND: if (atomic_read(&sdp->detaching)) return -ENODEV; return scsi_ioctl(sdp->device, filp->f_mode & FMODE_WRITE, cmd_in, p); case SG_SET_DEBUG: result = get_user(val, ip); if (result) return result; sdp->sgdebug = (char) val; return 0; case BLKSECTGET: return put_user(max_sectors_bytes(sdp->device->request_queue), ip); case BLKTRACESETUP: return blk_trace_setup(sdp->device->request_queue, sdp->name, MKDEV(SCSI_GENERIC_MAJOR, sdp->index), NULL, p); case BLKTRACESTART: return blk_trace_startstop(sdp->device->request_queue, 1); case BLKTRACESTOP: return blk_trace_startstop(sdp->device->request_queue, 0); case BLKTRACETEARDOWN: return blk_trace_remove(sdp->device->request_queue); case SCSI_IOCTL_GET_IDLUN: case SCSI_IOCTL_GET_BUS_NUMBER: case SCSI_IOCTL_PROBE_HOST: case SG_GET_TRANSFORM: case SG_SCSI_RESET: if (atomic_read(&sdp->detaching)) return -ENODEV; break; default: if (read_only) return -EPERM; /* don't know so take safe approach */ break; } result = scsi_ioctl_block_when_processing_errors(sdp->device, cmd_in, filp->f_flags & O_NDELAY); if (result) return result; return -ENOIOCTLCMD; } static long sg_ioctl(struct file *filp, unsigned int cmd_in, unsigned long arg) { void __user *p = (void __user *)arg; Sg_device *sdp; Sg_fd *sfp; int ret; if ((!(sfp = (Sg_fd *) filp->private_data)) || (!(sdp = sfp->parentdp))) return -ENXIO; ret = sg_ioctl_common(filp, sdp, sfp, cmd_in, p); if (ret != -ENOIOCTLCMD) return ret; return scsi_ioctl(sdp->device, filp->f_mode & FMODE_WRITE, cmd_in, p); } static __poll_t sg_poll(struct file *filp, poll_table * wait) { __poll_t res = 0; Sg_device *sdp; Sg_fd *sfp; Sg_request *srp; int count = 0; unsigned long iflags; sfp = filp->private_data; if (!sfp) return EPOLLERR; sdp = sfp->parentdp; if (!sdp) return EPOLLERR; poll_wait(filp, &sfp->read_wait, wait); read_lock_irqsave(&sfp->rq_list_lock, iflags); list_for_each_entry(srp, &sfp->rq_list, entry) { /* if any read waiting, flag it */ if ((0 == res) && (1 == srp->done) && (!srp->sg_io_owned)) res = EPOLLIN | EPOLLRDNORM; ++count; } read_unlock_irqrestore(&sfp->rq_list_lock, iflags); if (atomic_read(&sdp->detaching)) res |= EPOLLHUP; else if (!sfp->cmd_q) { if (0 == count) res |= EPOLLOUT | EPOLLWRNORM; } else if (count < SG_MAX_QUEUE) res |= EPOLLOUT | EPOLLWRNORM; SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_poll: res=0x%x\n", (__force u32) res)); return res; } static int sg_fasync(int fd, struct file *filp, int mode) { Sg_device *sdp; Sg_fd *sfp; if ((!(sfp = (Sg_fd *) filp->private_data)) || (!(sdp = sfp->parentdp))) return -ENXIO; SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_fasync: mode=%d\n", mode)); return fasync_helper(fd, filp, mode, &sfp->async_qp); } static vm_fault_t sg_vma_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; Sg_fd *sfp; unsigned long offset, len, sa; Sg_scatter_hold *rsv_schp; int k, length; if ((NULL == vma) || (!(sfp = (Sg_fd *) vma->vm_private_data))) return VM_FAULT_SIGBUS; rsv_schp = &sfp->reserve; offset = vmf->pgoff << PAGE_SHIFT; if (offset >= rsv_schp->bufflen) return VM_FAULT_SIGBUS; SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sfp->parentdp, "sg_vma_fault: offset=%lu, scatg=%d\n", offset, rsv_schp->k_use_sg)); sa = vma->vm_start; length = 1 << (PAGE_SHIFT + rsv_schp->page_order); for (k = 0; k < rsv_schp->k_use_sg && sa < vma->vm_end; k++) { len = vma->vm_end - sa; len = (len < length) ? len : length; if (offset < len) { struct page *page = nth_page(rsv_schp->pages[k], offset >> PAGE_SHIFT); get_page(page); /* increment page count */ vmf->page = page; return 0; /* success */ } sa += len; offset -= len; } return VM_FAULT_SIGBUS; } static const struct vm_operations_struct sg_mmap_vm_ops = { .fault = sg_vma_fault, }; static int sg_mmap(struct file *filp, struct vm_area_struct *vma) { Sg_fd *sfp; unsigned long req_sz, len, sa; Sg_scatter_hold *rsv_schp; int k, length; int ret = 0; if ((!filp) || (!vma) || (!(sfp = (Sg_fd *) filp->private_data))) return -ENXIO; req_sz = vma->vm_end - vma->vm_start; SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sfp->parentdp, "sg_mmap starting, vm_start=%p, len=%d\n", (void *) vma->vm_start, (int) req_sz)); if (vma->vm_pgoff) return -EINVAL; /* want no offset */ rsv_schp = &sfp->reserve; mutex_lock(&sfp->f_mutex); if (req_sz > rsv_schp->bufflen) { ret = -ENOMEM; /* cannot map more than reserved buffer */ goto out; } sa = vma->vm_start; length = 1 << (PAGE_SHIFT + rsv_schp->page_order); for (k = 0; k < rsv_schp->k_use_sg && sa < vma->vm_end; k++) { len = vma->vm_end - sa; len = (len < length) ? len : length; sa += len; } sfp->mmap_called = 1; vm_flags_set(vma, VM_IO | VM_DONTEXPAND | VM_DONTDUMP); vma->vm_private_data = sfp; vma->vm_ops = &sg_mmap_vm_ops; out: mutex_unlock(&sfp->f_mutex); return ret; } static void sg_rq_end_io_usercontext(struct work_struct *work) { struct sg_request *srp = container_of(work, struct sg_request, ew.work); struct sg_fd *sfp = srp->parentfp; sg_finish_rem_req(srp); sg_remove_request(sfp, srp); kref_put(&sfp->f_ref, sg_remove_sfp); } /* * This function is a "bottom half" handler that is called by the mid * level when a command is completed (or has failed). */ static enum rq_end_io_ret sg_rq_end_io(struct request *rq, blk_status_t status) { struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(rq); struct sg_request *srp = rq->end_io_data; Sg_device *sdp; Sg_fd *sfp; unsigned long iflags; unsigned int ms; char *sense; int result, resid, done = 1; if (WARN_ON(srp->done != 0)) return RQ_END_IO_NONE; sfp = srp->parentfp; if (WARN_ON(sfp == NULL)) return RQ_END_IO_NONE; sdp = sfp->parentdp; if (unlikely(atomic_read(&sdp->detaching))) pr_info("%s: device detaching\n", __func__); sense = scmd->sense_buffer; result = scmd->result; resid = scmd->resid_len; SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, sdp, "sg_cmd_done: pack_id=%d, res=0x%x\n", srp->header.pack_id, result)); srp->header.resid = resid; ms = jiffies_to_msecs(jiffies); srp->header.duration = (ms > srp->header.duration) ? (ms - srp->header.duration) : 0; if (0 != result) { struct scsi_sense_hdr sshdr; srp->header.status = 0xff & result; srp->header.masked_status = sg_status_byte(result); srp->header.msg_status = COMMAND_COMPLETE; srp->header.host_status = host_byte(result); srp->header.driver_status = driver_byte(result); if ((sdp->sgdebug > 0) && ((CHECK_CONDITION == srp->header.masked_status) || (COMMAND_TERMINATED == srp->header.masked_status))) __scsi_print_sense(sdp->device, __func__, sense, SCSI_SENSE_BUFFERSIZE); /* Following if statement is a patch supplied by Eric Youngdale */ if (driver_byte(result) != 0 && scsi_normalize_sense(sense, SCSI_SENSE_BUFFERSIZE, &sshdr) && !scsi_sense_is_deferred(&sshdr) && sshdr.sense_key == UNIT_ATTENTION && sdp->device->removable) { /* Detected possible disc change. Set the bit - this */ /* may be used if there are filesystems using this device */ sdp->device->changed = 1; } } if (scmd->sense_len) memcpy(srp->sense_b, scmd->sense_buffer, SCSI_SENSE_BUFFERSIZE); /* Rely on write phase to clean out srp status values, so no "else" */ /* * Free the request as soon as it is complete so that its resources * can be reused without waiting for userspace to read() the * result. But keep the associated bio (if any) around until * blk_rq_unmap_user() can be called from user context. */ srp->rq = NULL; blk_mq_free_request(rq); write_lock_irqsave(&sfp->rq_list_lock, iflags); if (unlikely(srp->orphan)) { if (sfp->keep_orphan) srp->sg_io_owned = 0; else done = 0; } srp->done = done; write_unlock_irqrestore(&sfp->rq_list_lock, iflags); if (likely(done)) { /* Now wake up any sg_read() that is waiting for this * packet. */ wake_up_interruptible(&sfp->read_wait); kill_fasync(&sfp->async_qp, SIGPOLL, POLL_IN); kref_put(&sfp->f_ref, sg_remove_sfp); } else { INIT_WORK(&srp->ew.work, sg_rq_end_io_usercontext); schedule_work(&srp->ew.work); } return RQ_END_IO_NONE; } static const struct file_operations sg_fops = { .owner = THIS_MODULE, .read = sg_read, .write = sg_write, .poll = sg_poll, .unlocked_ioctl = sg_ioctl, .compat_ioctl = compat_ptr_ioctl, .open = sg_open, .mmap = sg_mmap, .release = sg_release, .fasync = sg_fasync, }; static const struct class sg_sysfs_class = { .name = "scsi_generic" }; static int sg_sysfs_valid = 0; static Sg_device * sg_alloc(struct scsi_device *scsidp) { struct request_queue *q = scsidp->request_queue; Sg_device *sdp; unsigned long iflags; int error; u32 k; sdp = kzalloc(sizeof(Sg_device), GFP_KERNEL); if (!sdp) { sdev_printk(KERN_WARNING, scsidp, "%s: kmalloc Sg_device " "failure\n", __func__); return ERR_PTR(-ENOMEM); } idr_preload(GFP_KERNEL); write_lock_irqsave(&sg_index_lock, iflags); error = idr_alloc(&sg_index_idr, sdp, 0, SG_MAX_DEVS, GFP_NOWAIT); if (error < 0) { if (error == -ENOSPC) { sdev_printk(KERN_WARNING, scsidp, "Unable to attach sg device type=%d, minor number exceeds %d\n", scsidp->type, SG_MAX_DEVS - 1); error = -ENODEV; } else { sdev_printk(KERN_WARNING, scsidp, "%s: idr " "allocation Sg_device failure: %d\n", __func__, error); } goto out_unlock; } k = error; SCSI_LOG_TIMEOUT(3, sdev_printk(KERN_INFO, scsidp, "sg_alloc: dev=%d \n", k)); sprintf(sdp->name, "sg%d", k); sdp->device = scsidp; mutex_init(&sdp->open_rel_lock); INIT_LIST_HEAD(&sdp->sfds); init_waitqueue_head(&sdp->open_wait); atomic_set(&sdp->detaching, 0); rwlock_init(&sdp->sfd_lock); sdp->sg_tablesize = queue_max_segments(q); sdp->index = k; kref_init(&sdp->d_ref); error = 0; out_unlock: write_unlock_irqrestore(&sg_index_lock, iflags); idr_preload_end(); if (error) { kfree(sdp); return ERR_PTR(error); } return sdp; } static int sg_add_device(struct device *cl_dev) { struct scsi_device *scsidp = to_scsi_device(cl_dev->parent); Sg_device *sdp = NULL; struct cdev * cdev = NULL; int error; unsigned long iflags; if (!blk_get_queue(scsidp->request_queue)) { pr_warn("%s: get scsi_device queue failed\n", __func__); return -ENODEV; } error = -ENOMEM; cdev = cdev_alloc(); if (!cdev) { pr_warn("%s: cdev_alloc failed\n", __func__); goto out; } cdev->owner = THIS_MODULE; cdev->ops = &sg_fops; sdp = sg_alloc(scsidp); if (IS_ERR(sdp)) { pr_warn("%s: sg_alloc failed\n", __func__); error = PTR_ERR(sdp); goto out; } error = cdev_add(cdev, MKDEV(SCSI_GENERIC_MAJOR, sdp->index), 1); if (error) goto cdev_add_err; sdp->cdev = cdev; if (sg_sysfs_valid) { struct device *sg_class_member; sg_class_member = device_create(&sg_sysfs_class, cl_dev->parent, MKDEV(SCSI_GENERIC_MAJOR, sdp->index), sdp, "%s", sdp->name); if (IS_ERR(sg_class_member)) { pr_err("%s: device_create failed\n", __func__); error = PTR_ERR(sg_class_member); goto cdev_add_err; } error = sysfs_create_link(&scsidp->sdev_gendev.kobj, &sg_class_member->kobj, "generic"); if (error) pr_err("%s: unable to make symlink 'generic' back " "to sg%d\n", __func__, sdp->index); } else pr_warn("%s: sg_sys Invalid\n", __func__); sdev_printk(KERN_NOTICE, scsidp, "Attached scsi generic sg%d " "type %d\n", sdp->index, scsidp->type); dev_set_drvdata(cl_dev, sdp); return 0; cdev_add_err: write_lock_irqsave(&sg_index_lock, iflags); idr_remove(&sg_index_idr, sdp->index); write_unlock_irqrestore(&sg_index_lock, iflags); kfree(sdp); out: if (cdev) cdev_del(cdev); blk_put_queue(scsidp->request_queue); return error; } static void sg_device_destroy(struct kref *kref) { struct sg_device *sdp = container_of(kref, struct sg_device, d_ref); struct request_queue *q = sdp->device->request_queue; unsigned long flags; /* CAUTION! Note that the device can still be found via idr_find() * even though the refcount is 0. Therefore, do idr_remove() BEFORE * any other cleanup. */ blk_trace_remove(q); blk_put_queue(q); write_lock_irqsave(&sg_index_lock, flags); idr_remove(&sg_index_idr, sdp->index); write_unlock_irqrestore(&sg_index_lock, flags); SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_device_destroy\n")); kfree(sdp); } static void sg_remove_device(struct device *cl_dev) { struct scsi_device *scsidp = to_scsi_device(cl_dev->parent); Sg_device *sdp = dev_get_drvdata(cl_dev); unsigned long iflags; Sg_fd *sfp; int val; if (!sdp) return; /* want sdp->detaching non-zero as soon as possible */ val = atomic_inc_return(&sdp->detaching); if (val > 1) return; /* only want to do following once per device */ SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "%s\n", __func__)); read_lock_irqsave(&sdp->sfd_lock, iflags); list_for_each_entry(sfp, &sdp->sfds, sfd_siblings) { wake_up_interruptible_all(&sfp->read_wait); kill_fasync(&sfp->async_qp, SIGPOLL, POLL_HUP); } wake_up_interruptible_all(&sdp->open_wait); read_unlock_irqrestore(&sdp->sfd_lock, iflags); sysfs_remove_link(&scsidp->sdev_gendev.kobj, "generic"); device_destroy(&sg_sysfs_class, MKDEV(SCSI_GENERIC_MAJOR, sdp->index)); cdev_del(sdp->cdev); sdp->cdev = NULL; kref_put(&sdp->d_ref, sg_device_destroy); } module_param_named(scatter_elem_sz, scatter_elem_sz, int, S_IRUGO | S_IWUSR); module_param_named(def_reserved_size, def_reserved_size, int, S_IRUGO | S_IWUSR); module_param_named(allow_dio, sg_allow_dio, int, S_IRUGO | S_IWUSR); MODULE_AUTHOR("Douglas Gilbert"); MODULE_DESCRIPTION("SCSI generic (sg) driver"); MODULE_LICENSE("GPL"); MODULE_VERSION(SG_VERSION_STR); MODULE_ALIAS_CHARDEV_MAJOR(SCSI_GENERIC_MAJOR); MODULE_PARM_DESC(scatter_elem_sz, "scatter gather element " "size (default: max(SG_SCATTER_SZ, PAGE_SIZE))"); MODULE_PARM_DESC(def_reserved_size, "size of buffer reserved for each fd"); MODULE_PARM_DESC(allow_dio, "allow direct I/O (default: 0 (disallow))"); #ifdef CONFIG_SYSCTL #include <linux/sysctl.h> static const struct ctl_table sg_sysctls[] = { { .procname = "sg-big-buff", .data = &sg_big_buff, .maxlen = sizeof(int), .mode = 0444, .proc_handler = proc_dointvec, }, }; static struct ctl_table_header *hdr; static void register_sg_sysctls(void) { if (!hdr) hdr = register_sysctl("kernel", sg_sysctls); } static void unregister_sg_sysctls(void) { unregister_sysctl_table(hdr); } #else #define register_sg_sysctls() do { } while (0) #define unregister_sg_sysctls() do { } while (0) #endif /* CONFIG_SYSCTL */ static int __init init_sg(void) { int rc; if (scatter_elem_sz < PAGE_SIZE) { scatter_elem_sz = PAGE_SIZE; scatter_elem_sz_prev = scatter_elem_sz; } if (def_reserved_size >= 0) sg_big_buff = def_reserved_size; else def_reserved_size = sg_big_buff; rc = register_chrdev_region(MKDEV(SCSI_GENERIC_MAJOR, 0), SG_MAX_DEVS, "sg"); if (rc) return rc; rc = class_register(&sg_sysfs_class); if (rc) goto err_out; sg_sysfs_valid = 1; rc = scsi_register_interface(&sg_interface); if (0 == rc) { #ifdef CONFIG_SCSI_PROC_FS sg_proc_init(); #endif /* CONFIG_SCSI_PROC_FS */ return 0; } class_unregister(&sg_sysfs_class); register_sg_sysctls(); err_out: unregister_chrdev_region(MKDEV(SCSI_GENERIC_MAJOR, 0), SG_MAX_DEVS); return rc; } static void __exit exit_sg(void) { unregister_sg_sysctls(); #ifdef CONFIG_SCSI_PROC_FS remove_proc_subtree("scsi/sg", NULL); #endif /* CONFIG_SCSI_PROC_FS */ scsi_unregister_interface(&sg_interface); class_unregister(&sg_sysfs_class); sg_sysfs_valid = 0; unregister_chrdev_region(MKDEV(SCSI_GENERIC_MAJOR, 0), SG_MAX_DEVS); idr_destroy(&sg_index_idr); } static int sg_start_req(Sg_request *srp, unsigned char *cmd) { int res; struct request *rq; Sg_fd *sfp = srp->parentfp; sg_io_hdr_t *hp = &srp->header; int dxfer_len = (int) hp->dxfer_len; int dxfer_dir = hp->dxfer_direction; unsigned int iov_count = hp->iovec_count; Sg_scatter_hold *req_schp = &srp->data; Sg_scatter_hold *rsv_schp = &sfp->reserve; struct request_queue *q = sfp->parentdp->device->request_queue; struct rq_map_data *md, map_data; int rw = hp->dxfer_direction == SG_DXFER_TO_DEV ? ITER_SOURCE : ITER_DEST; struct scsi_cmnd *scmd; SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, sfp->parentdp, "sg_start_req: dxfer_len=%d\n", dxfer_len)); /* * NOTE * * With scsi-mq enabled, there are a fixed number of preallocated * requests equal in number to shost->can_queue. If all of the * preallocated requests are already in use, then scsi_alloc_request() * will sleep until an active command completes, freeing up a request. * Although waiting in an asynchronous interface is less than ideal, we * do not want to use BLK_MQ_REQ_NOWAIT here because userspace might * not expect an EWOULDBLOCK from this condition. */ rq = scsi_alloc_request(q, hp->dxfer_direction == SG_DXFER_TO_DEV ? REQ_OP_DRV_OUT : REQ_OP_DRV_IN, 0); if (IS_ERR(rq)) return PTR_ERR(rq); scmd = blk_mq_rq_to_pdu(rq); if (hp->cmd_len > sizeof(scmd->cmnd)) { blk_mq_free_request(rq); return -EINVAL; } memcpy(scmd->cmnd, cmd, hp->cmd_len); scmd->cmd_len = hp->cmd_len; srp->rq = rq; rq->end_io_data = srp; scmd->allowed = SG_DEFAULT_RETRIES; if ((dxfer_len <= 0) || (dxfer_dir == SG_DXFER_NONE)) return 0; if (sg_allow_dio && hp->flags & SG_FLAG_DIRECT_IO && dxfer_dir != SG_DXFER_UNKNOWN && !iov_count && blk_rq_aligned(q, (unsigned long)hp->dxferp, dxfer_len)) md = NULL; else md = &map_data; if (md) { mutex_lock(&sfp->f_mutex); if (dxfer_len <= rsv_schp->bufflen && !sfp->res_in_use) { sfp->res_in_use = 1; sg_link_reserve(sfp, srp, dxfer_len); } else if (hp->flags & SG_FLAG_MMAP_IO) { res = -EBUSY; /* sfp->res_in_use == 1 */ if (dxfer_len > rsv_schp->bufflen) res = -ENOMEM; mutex_unlock(&sfp->f_mutex); return res; } else { res = sg_build_indirect(req_schp, sfp, dxfer_len); if (res) { mutex_unlock(&sfp->f_mutex); return res; } } mutex_unlock(&sfp->f_mutex); md->pages = req_schp->pages; md->page_order = req_schp->page_order; md->nr_entries = req_schp->k_use_sg; md->offset = 0; md->null_mapped = hp->dxferp ? 0 : 1; if (dxfer_dir == SG_DXFER_TO_FROM_DEV) md->from_user = 1; else md->from_user = 0; } res = blk_rq_map_user_io(rq, md, hp->dxferp, hp->dxfer_len, GFP_ATOMIC, iov_count, iov_count, 1, rw); if (!res) { srp->bio = rq->bio; if (!md) { req_schp->dio_in_use = 1; hp->info |= SG_INFO_DIRECT_IO; } } return res; } static int sg_finish_rem_req(Sg_request *srp) { int ret = 0; Sg_fd *sfp = srp->parentfp; Sg_scatter_hold *req_schp = &srp->data; SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, sfp->parentdp, "sg_finish_rem_req: res_used=%d\n", (int) srp->res_used)); if (srp->bio) ret = blk_rq_unmap_user(srp->bio); if (srp->rq) blk_mq_free_request(srp->rq); if (srp->res_used) sg_unlink_reserve(sfp, srp); else sg_remove_scat(sfp, req_schp); return ret; } static int sg_build_sgat(Sg_scatter_hold * schp, const Sg_fd * sfp, int tablesize) { int sg_bufflen = tablesize * sizeof(struct page *); gfp_t gfp_flags = GFP_ATOMIC | __GFP_NOWARN; schp->pages = kzalloc(sg_bufflen, gfp_flags); if (!schp->pages) return -ENOMEM; schp->sglist_len = sg_bufflen; return tablesize; /* number of scat_gath elements allocated */ } static int sg_build_indirect(Sg_scatter_hold * schp, Sg_fd * sfp, int buff_size) { int ret_sz = 0, i, k, rem_sz, num, mx_sc_elems; int sg_tablesize = sfp->parentdp->sg_tablesize; int blk_size = buff_size, order; gfp_t gfp_mask = GFP_ATOMIC | __GFP_COMP | __GFP_NOWARN | __GFP_ZERO; if (blk_size < 0) return -EFAULT; if (0 == blk_size) ++blk_size; /* don't know why */ /* round request up to next highest SG_SECTOR_SZ byte boundary */ blk_size = ALIGN(blk_size, SG_SECTOR_SZ); SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, sfp->parentdp, "sg_build_indirect: buff_size=%d, blk_size=%d\n", buff_size, blk_size)); /* N.B. ret_sz carried into this block ... */ mx_sc_elems = sg_build_sgat(schp, sfp, sg_tablesize); if (mx_sc_elems < 0) return mx_sc_elems; /* most likely -ENOMEM */ num = scatter_elem_sz; if (unlikely(num != scatter_elem_sz_prev)) { if (num < PAGE_SIZE) { scatter_elem_sz = PAGE_SIZE; scatter_elem_sz_prev = PAGE_SIZE; } else scatter_elem_sz_prev = num; } order = get_order(num); retry: ret_sz = 1 << (PAGE_SHIFT + order); for (k = 0, rem_sz = blk_size; rem_sz > 0 && k < mx_sc_elems; k++, rem_sz -= ret_sz) { num = (rem_sz > scatter_elem_sz_prev) ? scatter_elem_sz_prev : rem_sz; schp->pages[k] = alloc_pages(gfp_mask, order); if (!schp->pages[k]) goto out; if (num == scatter_elem_sz_prev) { if (unlikely(ret_sz > scatter_elem_sz_prev)) { scatter_elem_sz = ret_sz; scatter_elem_sz_prev = ret_sz; } } SCSI_LOG_TIMEOUT(5, sg_printk(KERN_INFO, sfp->parentdp, "sg_build_indirect: k=%d, num=%d, ret_sz=%d\n", k, num, ret_sz)); } /* end of for loop */ schp->page_order = order; schp->k_use_sg = k; SCSI_LOG_TIMEOUT(5, sg_printk(KERN_INFO, sfp->parentdp, "sg_build_indirect: k_use_sg=%d, rem_sz=%d\n", k, rem_sz)); schp->bufflen = blk_size; if (rem_sz > 0) /* must have failed */ return -ENOMEM; return 0; out: for (i = 0; i < k; i++) __free_pages(schp->pages[i], order); if (--order >= 0) goto retry; return -ENOMEM; } static void sg_remove_scat(Sg_fd * sfp, Sg_scatter_hold * schp) { SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, sfp->parentdp, "sg_remove_scat: k_use_sg=%d\n", schp->k_use_sg)); if (schp->pages && schp->sglist_len > 0) { if (!schp->dio_in_use) { int k; for (k = 0; k < schp->k_use_sg && schp->pages[k]; k++) { SCSI_LOG_TIMEOUT(5, sg_printk(KERN_INFO, sfp->parentdp, "sg_remove_scat: k=%d, pg=0x%p\n", k, schp->pages[k])); __free_pages(schp->pages[k], schp->page_order); } kfree(schp->pages); } } memset(schp, 0, sizeof (*schp)); } static int sg_read_oxfer(Sg_request * srp, char __user *outp, int num_read_xfer) { Sg_scatter_hold *schp = &srp->data; int k, num; SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, srp->parentfp->parentdp, "sg_read_oxfer: num_read_xfer=%d\n", num_read_xfer)); if ((!outp) || (num_read_xfer <= 0)) return 0; num = 1 << (PAGE_SHIFT + schp->page_order); for (k = 0; k < schp->k_use_sg && schp->pages[k]; k++) { if (num > num_read_xfer) { if (copy_to_user(outp, page_address(schp->pages[k]), num_read_xfer)) return -EFAULT; break; } else { if (copy_to_user(outp, page_address(schp->pages[k]), num)) return -EFAULT; num_read_xfer -= num; if (num_read_xfer <= 0) break; outp += num; } } return 0; } static void sg_build_reserve(Sg_fd * sfp, int req_size) { Sg_scatter_hold *schp = &sfp->reserve; SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, sfp->parentdp, "sg_build_reserve: req_size=%d\n", req_size)); do { if (req_size < PAGE_SIZE) req_size = PAGE_SIZE; if (0 == sg_build_indirect(schp, sfp, req_size)) return; else sg_remove_scat(sfp, schp); req_size >>= 1; /* divide by 2 */ } while (req_size > (PAGE_SIZE / 2)); } static void sg_link_reserve(Sg_fd * sfp, Sg_request * srp, int size) { Sg_scatter_hold *req_schp = &srp->data; Sg_scatter_hold *rsv_schp = &sfp->reserve; int k, num, rem; srp->res_used = 1; SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, sfp->parentdp, "sg_link_reserve: size=%d\n", size)); rem = size; num = 1 << (PAGE_SHIFT + rsv_schp->page_order); for (k = 0; k < rsv_schp->k_use_sg; k++) { if (rem <= num) { req_schp->k_use_sg = k + 1; req_schp->sglist_len = rsv_schp->sglist_len; req_schp->pages = rsv_schp->pages; req_schp->bufflen = size; req_schp->page_order = rsv_schp->page_order; break; } else rem -= num; } if (k >= rsv_schp->k_use_sg) SCSI_LOG_TIMEOUT(1, sg_printk(KERN_INFO, sfp->parentdp, "sg_link_reserve: BAD size\n")); } static void sg_unlink_reserve(Sg_fd * sfp, Sg_request * srp) { Sg_scatter_hold *req_schp = &srp->data; SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, srp->parentfp->parentdp, "sg_unlink_reserve: req->k_use_sg=%d\n", (int) req_schp->k_use_sg)); req_schp->k_use_sg = 0; req_schp->bufflen = 0; req_schp->pages = NULL; req_schp->page_order = 0; req_schp->sglist_len = 0; srp->res_used = 0; /* Called without mutex lock to avoid deadlock */ sfp->res_in_use = 0; } static Sg_request * sg_get_rq_mark(Sg_fd * sfp, int pack_id, bool *busy) { Sg_request *resp; unsigned long iflags; *busy = false; write_lock_irqsave(&sfp->rq_list_lock, iflags); list_for_each_entry(resp, &sfp->rq_list, entry) { /* look for requests that are not SG_IO owned */ if ((!resp->sg_io_owned) && ((-1 == pack_id) || (resp->header.pack_id == pack_id))) { switch (resp->done) { case 0: /* request active */ *busy = true; break; case 1: /* request done; response ready to return */ resp->done = 2; /* guard against other readers */ write_unlock_irqrestore(&sfp->rq_list_lock, iflags); return resp; case 2: /* response already being returned */ break; } } } write_unlock_irqrestore(&sfp->rq_list_lock, iflags); return NULL; } /* always adds to end of list */ static Sg_request * sg_add_request(Sg_fd * sfp) { int k; unsigned long iflags; Sg_request *rp = sfp->req_arr; write_lock_irqsave(&sfp->rq_list_lock, iflags); if (!list_empty(&sfp->rq_list)) { if (!sfp->cmd_q) goto out_unlock; for (k = 0; k < SG_MAX_QUEUE; ++k, ++rp) { if (!rp->parentfp) break; } if (k >= SG_MAX_QUEUE) goto out_unlock; } memset(rp, 0, sizeof (Sg_request)); rp->parentfp = sfp; rp->header.duration = jiffies_to_msecs(jiffies); list_add_tail(&rp->entry, &sfp->rq_list); write_unlock_irqrestore(&sfp->rq_list_lock, iflags); return rp; out_unlock: write_unlock_irqrestore(&sfp->rq_list_lock, iflags); return NULL; } /* Return of 1 for found; 0 for not found */ static int sg_remove_request(Sg_fd * sfp, Sg_request * srp) { unsigned long iflags; int res = 0; if (!sfp || !srp || list_empty(&sfp->rq_list)) return res; write_lock_irqsave(&sfp->rq_list_lock, iflags); if (!list_empty(&srp->entry)) { list_del(&srp->entry); srp->parentfp = NULL; res = 1; } write_unlock_irqrestore(&sfp->rq_list_lock, iflags); /* * If the device is detaching, wakeup any readers in case we just * removed the last response, which would leave nothing for them to * return other than -ENODEV. */ if (unlikely(atomic_read(&sfp->parentdp->detaching))) wake_up_interruptible_all(&sfp->read_wait); return res; } static Sg_fd * sg_add_sfp(Sg_device * sdp) { Sg_fd *sfp; unsigned long iflags; int bufflen; sfp = kzalloc(sizeof(*sfp), GFP_ATOMIC | __GFP_NOWARN); if (!sfp) return ERR_PTR(-ENOMEM); init_waitqueue_head(&sfp->read_wait); rwlock_init(&sfp->rq_list_lock); INIT_LIST_HEAD(&sfp->rq_list); kref_init(&sfp->f_ref); mutex_init(&sfp->f_mutex); sfp->timeout = SG_DEFAULT_TIMEOUT; sfp->timeout_user = SG_DEFAULT_TIMEOUT_USER; sfp->force_packid = SG_DEF_FORCE_PACK_ID; sfp->cmd_q = SG_DEF_COMMAND_Q; sfp->keep_orphan = SG_DEF_KEEP_ORPHAN; sfp->parentdp = sdp; write_lock_irqsave(&sdp->sfd_lock, iflags); if (atomic_read(&sdp->detaching)) { write_unlock_irqrestore(&sdp->sfd_lock, iflags); kfree(sfp); return ERR_PTR(-ENODEV); } list_add_tail(&sfp->sfd_siblings, &sdp->sfds); write_unlock_irqrestore(&sdp->sfd_lock, iflags); SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_add_sfp: sfp=0x%p\n", sfp)); if (unlikely(sg_big_buff != def_reserved_size)) sg_big_buff = def_reserved_size; bufflen = min_t(int, sg_big_buff, max_sectors_bytes(sdp->device->request_queue)); sg_build_reserve(sfp, bufflen); SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_add_sfp: bufflen=%d, k_use_sg=%d\n", sfp->reserve.bufflen, sfp->reserve.k_use_sg)); kref_get(&sdp->d_ref); __module_get(THIS_MODULE); return sfp; } static void sg_remove_sfp_usercontext(struct work_struct *work) { struct sg_fd *sfp = container_of(work, struct sg_fd, ew.work); struct sg_device *sdp = sfp->parentdp; struct scsi_device *device = sdp->device; Sg_request *srp; unsigned long iflags; /* Cleanup any responses which were never read(). */ write_lock_irqsave(&sfp->rq_list_lock, iflags); while (!list_empty(&sfp->rq_list)) { srp = list_first_entry(&sfp->rq_list, Sg_request, entry); sg_finish_rem_req(srp); list_del(&srp->entry); srp->parentfp = NULL; } write_unlock_irqrestore(&sfp->rq_list_lock, iflags); if (sfp->reserve.bufflen > 0) { SCSI_LOG_TIMEOUT(6, sg_printk(KERN_INFO, sdp, "sg_remove_sfp: bufflen=%d, k_use_sg=%d\n", (int) sfp->reserve.bufflen, (int) sfp->reserve.k_use_sg)); sg_remove_scat(sfp, &sfp->reserve); } SCSI_LOG_TIMEOUT(6, sg_printk(KERN_INFO, sdp, "sg_remove_sfp: sfp=0x%p\n", sfp)); kfree(sfp); kref_put(&sdp->d_ref, sg_device_destroy); scsi_device_put(device); module_put(THIS_MODULE); } static void sg_remove_sfp(struct kref *kref) { struct sg_fd *sfp = container_of(kref, struct sg_fd, f_ref); struct sg_device *sdp = sfp->parentdp; unsigned long iflags; write_lock_irqsave(&sdp->sfd_lock, iflags); list_del(&sfp->sfd_siblings); write_unlock_irqrestore(&sdp->sfd_lock, iflags); INIT_WORK(&sfp->ew.work, sg_remove_sfp_usercontext); schedule_work(&sfp->ew.work); } #ifdef CONFIG_SCSI_PROC_FS static int sg_idr_max_id(int id, void *p, void *data) { int *k = data; if (*k < id) *k = id; return 0; } static int sg_last_dev(void) { int k = -1; unsigned long iflags; read_lock_irqsave(&sg_index_lock, iflags); idr_for_each(&sg_index_idr, sg_idr_max_id, &k); read_unlock_irqrestore(&sg_index_lock, iflags); return k + 1; /* origin 1 */ } #endif /* must be called with sg_index_lock held */ static Sg_device *sg_lookup_dev(int dev) { return idr_find(&sg_index_idr, dev); } static Sg_device * sg_get_dev(int dev) { struct sg_device *sdp; unsigned long flags; read_lock_irqsave(&sg_index_lock, flags); sdp = sg_lookup_dev(dev); if (!sdp) sdp = ERR_PTR(-ENXIO); else if (atomic_read(&sdp->detaching)) { /* If sdp->detaching, then the refcount may already be 0, in * which case it would be a bug to do kref_get(). */ sdp = ERR_PTR(-ENODEV); } else kref_get(&sdp->d_ref); read_unlock_irqrestore(&sg_index_lock, flags); return sdp; } #ifdef CONFIG_SCSI_PROC_FS static int sg_proc_seq_show_int(struct seq_file *s, void *v); static int sg_proc_single_open_adio(struct inode *inode, struct file *file); static ssize_t sg_proc_write_adio(struct file *filp, const char __user *buffer, size_t count, loff_t *off); static const struct proc_ops adio_proc_ops = { .proc_open = sg_proc_single_open_adio, .proc_read = seq_read, .proc_lseek = seq_lseek, .proc_write = sg_proc_write_adio, .proc_release = single_release, }; static int sg_proc_single_open_dressz(struct inode *inode, struct file *file); static ssize_t sg_proc_write_dressz(struct file *filp, const char __user *buffer, size_t count, loff_t *off); static const struct proc_ops dressz_proc_ops = { .proc_open = sg_proc_single_open_dressz, .proc_read = seq_read, .proc_lseek = seq_lseek, .proc_write = sg_proc_write_dressz, .proc_release = single_release, }; static int sg_proc_seq_show_version(struct seq_file *s, void *v); static int sg_proc_seq_show_devhdr(struct seq_file *s, void *v); static int sg_proc_seq_show_dev(struct seq_file *s, void *v); static void * dev_seq_start(struct seq_file *s, loff_t *pos); static void * dev_seq_next(struct seq_file *s, void *v, loff_t *pos); static void dev_seq_stop(struct seq_file *s, void *v); static const struct seq_operations dev_seq_ops = { .start = dev_seq_start, .next = dev_seq_next, .stop = dev_seq_stop, .show = sg_proc_seq_show_dev, }; static int sg_proc_seq_show_devstrs(struct seq_file *s, void *v); static const struct seq_operations devstrs_seq_ops = { .start = dev_seq_start, .next = dev_seq_next, .stop = dev_seq_stop, .show = sg_proc_seq_show_devstrs, }; static int sg_proc_seq_show_debug(struct seq_file *s, void *v); static const struct seq_operations debug_seq_ops = { .start = dev_seq_start, .next = dev_seq_next, .stop = dev_seq_stop, .show = sg_proc_seq_show_debug, }; static int sg_proc_init(void) { struct proc_dir_entry *p; p = proc_mkdir("scsi/sg", NULL); if (!p) return 1; proc_create("allow_dio", S_IRUGO | S_IWUSR, p, &adio_proc_ops); proc_create_seq("debug", S_IRUGO, p, &debug_seq_ops); proc_create("def_reserved_size", S_IRUGO | S_IWUSR, p, &dressz_proc_ops); proc_create_single("device_hdr", S_IRUGO, p, sg_proc_seq_show_devhdr); proc_create_seq("devices", S_IRUGO, p, &dev_seq_ops); proc_create_seq("device_strs", S_IRUGO, p, &devstrs_seq_ops); proc_create_single("version", S_IRUGO, p, sg_proc_seq_show_version); return 0; } static int sg_proc_seq_show_int(struct seq_file *s, void *v) { seq_printf(s, "%d\n", *((int *)s->private)); return 0; } static int sg_proc_single_open_adio(struct inode *inode, struct file *file) { return single_open(file, sg_proc_seq_show_int, &sg_allow_dio); } static ssize_t sg_proc_write_adio(struct file *filp, const char __user *buffer, size_t count, loff_t *off) { int err; unsigned long num; if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO)) return -EACCES; err = kstrtoul_from_user(buffer, count, 0, &num); if (err) return err; sg_allow_dio = num ? 1 : 0; return count; } static int sg_proc_single_open_dressz(struct inode *inode, struct file *file) { return single_open(file, sg_proc_seq_show_int, &sg_big_buff); } static ssize_t sg_proc_write_dressz(struct file *filp, const char __user *buffer, size_t count, loff_t *off) { int err; unsigned long k = ULONG_MAX; if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO)) return -EACCES; err = kstrtoul_from_user(buffer, count, 0, &k); if (err) return err; if (k <= 1048576) { /* limit "big buff" to 1 MB */ sg_big_buff = k; return count; } return -ERANGE; } static int sg_proc_seq_show_version(struct seq_file *s, void *v) { seq_printf(s, "%d\t%s [%s]\n", sg_version_num, SG_VERSION_STR, sg_version_date); return 0; } static int sg_proc_seq_show_devhdr(struct seq_file *s, void *v) { seq_puts(s, "host\tchan\tid\tlun\ttype\topens\tqdepth\tbusy\tonline\n"); return 0; } struct sg_proc_deviter { loff_t index; size_t max; }; static void * dev_seq_start(struct seq_file *s, loff_t *pos) { struct sg_proc_deviter * it = kmalloc(sizeof(*it), GFP_KERNEL); s->private = it; if (! it) return NULL; it->index = *pos; it->max = sg_last_dev(); if (it->index >= it->max) return NULL; return it; } static void * dev_seq_next(struct seq_file *s, void *v, loff_t *pos) { struct sg_proc_deviter * it = s->private; *pos = ++it->index; return (it->index < it->max) ? it : NULL; } static void dev_seq_stop(struct seq_file *s, void *v) { kfree(s->private); } static int sg_proc_seq_show_dev(struct seq_file *s, void *v) { struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; Sg_device *sdp; struct scsi_device *scsidp; unsigned long iflags; read_lock_irqsave(&sg_index_lock, iflags); sdp = it ? sg_lookup_dev(it->index) : NULL; if ((NULL == sdp) || (NULL == sdp->device) || (atomic_read(&sdp->detaching))) seq_puts(s, "-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\n"); else { scsidp = sdp->device; seq_printf(s, "%d\t%d\t%d\t%llu\t%d\t%d\t%d\t%d\t%d\n", scsidp->host->host_no, scsidp->channel, scsidp->id, scsidp->lun, (int) scsidp->type, 1, (int) scsidp->queue_depth, (int) scsi_device_busy(scsidp), (int) scsi_device_online(scsidp)); } read_unlock_irqrestore(&sg_index_lock, iflags); return 0; } static int sg_proc_seq_show_devstrs(struct seq_file *s, void *v) { struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; Sg_device *sdp; struct scsi_device *scsidp; unsigned long iflags; read_lock_irqsave(&sg_index_lock, iflags); sdp = it ? sg_lookup_dev(it->index) : NULL; scsidp = sdp ? sdp->device : NULL; if (sdp && scsidp && (!atomic_read(&sdp->detaching))) seq_printf(s, "%8.8s\t%16.16s\t%4.4s\n", scsidp->vendor, scsidp->model, scsidp->rev); else seq_puts(s, "<no active device>\n"); read_unlock_irqrestore(&sg_index_lock, iflags); return 0; } /* must be called while holding sg_index_lock */ static void sg_proc_debug_helper(struct seq_file *s, Sg_device * sdp) { int k, new_interface, blen, usg; Sg_request *srp; Sg_fd *fp; const sg_io_hdr_t *hp; const char * cp; unsigned int ms; k = 0; list_for_each_entry(fp, &sdp->sfds, sfd_siblings) { k++; read_lock(&fp->rq_list_lock); /* irqs already disabled */ seq_printf(s, " FD(%d): timeout=%dms bufflen=%d " "(res)sgat=%d low_dma=%d\n", k, jiffies_to_msecs(fp->timeout), fp->reserve.bufflen, (int) fp->reserve.k_use_sg, 0); seq_printf(s, " cmd_q=%d f_packid=%d k_orphan=%d closed=0\n", (int) fp->cmd_q, (int) fp->force_packid, (int) fp->keep_orphan); list_for_each_entry(srp, &fp->rq_list, entry) { hp = &srp->header; new_interface = (hp->interface_id == '\0') ? 0 : 1; if (srp->res_used) { if (new_interface && (SG_FLAG_MMAP_IO & hp->flags)) cp = " mmap>> "; else cp = " rb>> "; } else { if (SG_INFO_DIRECT_IO_MASK & hp->info) cp = " dio>> "; else cp = " "; } seq_puts(s, cp); blen = srp->data.bufflen; usg = srp->data.k_use_sg; seq_puts(s, srp->done ? ((1 == srp->done) ? "rcv:" : "fin:") : "act:"); seq_printf(s, " id=%d blen=%d", srp->header.pack_id, blen); if (srp->done) seq_printf(s, " dur=%d", hp->duration); else { ms = jiffies_to_msecs(jiffies); seq_printf(s, " t_o/elap=%d/%d", (new_interface ? hp->timeout : jiffies_to_msecs(fp->timeout)), (ms > hp->duration ? ms - hp->duration : 0)); } seq_printf(s, "ms sgat=%d op=0x%02x\n", usg, (int) srp->data.cmd_opcode); } if (list_empty(&fp->rq_list)) seq_puts(s, " No requests active\n"); read_unlock(&fp->rq_list_lock); } } static int sg_proc_seq_show_debug(struct seq_file *s, void *v) { struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; Sg_device *sdp; unsigned long iflags; if (it && (0 == it->index)) seq_printf(s, "max_active_device=%d def_reserved_size=%d\n", (int)it->max, sg_big_buff); read_lock_irqsave(&sg_index_lock, iflags); sdp = it ? sg_lookup_dev(it->index) : NULL; if (NULL == sdp) goto skip; read_lock(&sdp->sfd_lock); if (!list_empty(&sdp->sfds)) { seq_printf(s, " >>> device=%s ", sdp->name); if (atomic_read(&sdp->detaching)) seq_puts(s, "detaching pending close "); else if (sdp->device) { struct scsi_device *scsidp = sdp->device; seq_printf(s, "%d:%d:%d:%llu em=%d", scsidp->host->host_no, scsidp->channel, scsidp->id, scsidp->lun, scsidp->host->hostt->emulated); } seq_printf(s, " sg_tablesize=%d excl=%d open_cnt=%d\n", sdp->sg_tablesize, sdp->exclude, sdp->open_cnt); sg_proc_debug_helper(s, sdp); } read_unlock(&sdp->sfd_lock); skip: read_unlock_irqrestore(&sg_index_lock, iflags); return 0; } #endif /* CONFIG_SCSI_PROC_FS */ module_init(init_sg); module_exit(exit_sg); |
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Shared Memory Communications over RDMA (SMC-R) and RoCE * * Definitions for LLC (link layer control) message handling * * Copyright IBM Corp. 2016 * * Author(s): Klaus Wacker <Klaus.Wacker@de.ibm.com> * Ursula Braun <ubraun@linux.vnet.ibm.com> */ #ifndef SMC_LLC_H #define SMC_LLC_H #include "smc_wr.h" #define SMC_LLC_FLAG_RESP 0x80 #define SMC_LLC_WAIT_FIRST_TIME (5 * HZ) #define SMC_LLC_WAIT_TIME (2 * HZ) #define SMC_LLC_TESTLINK_DEFAULT_TIME (30 * HZ) enum smc_llc_reqresp { SMC_LLC_REQ, SMC_LLC_RESP }; enum smc_llc_msg_type { SMC_LLC_CONFIRM_LINK = 0x01, SMC_LLC_ADD_LINK = 0x02, SMC_LLC_ADD_LINK_CONT = 0x03, SMC_LLC_DELETE_LINK = 0x04, SMC_LLC_REQ_ADD_LINK = 0x05, SMC_LLC_CONFIRM_RKEY = 0x06, SMC_LLC_TEST_LINK = 0x07, SMC_LLC_CONFIRM_RKEY_CONT = 0x08, SMC_LLC_DELETE_RKEY = 0x09, /* V2 types */ SMC_LLC_CONFIRM_LINK_V2 = 0x21, SMC_LLC_ADD_LINK_V2 = 0x22, SMC_LLC_DELETE_LINK_V2 = 0x24, SMC_LLC_REQ_ADD_LINK_V2 = 0x25, SMC_LLC_CONFIRM_RKEY_V2 = 0x26, SMC_LLC_TEST_LINK_V2 = 0x27, SMC_LLC_DELETE_RKEY_V2 = 0x29, }; #define smc_link_downing(state) \ (cmpxchg(state, SMC_LNK_ACTIVE, SMC_LNK_INACTIVE) == SMC_LNK_ACTIVE) /* LLC DELETE LINK Request Reason Codes */ #define SMC_LLC_DEL_LOST_PATH 0x00010000 #define SMC_LLC_DEL_OP_INIT_TERM 0x00020000 #define SMC_LLC_DEL_PROG_INIT_TERM 0x00030000 #define SMC_LLC_DEL_PROT_VIOL 0x00040000 #define SMC_LLC_DEL_NO_ASYM_NEEDED 0x00050000 /* LLC DELETE LINK Response Reason Codes */ #define SMC_LLC_DEL_NOLNK 0x00100000 /* Unknown Link ID (no link) */ #define SMC_LLC_DEL_NOLGR 0x00200000 /* Unknown Link Group */ /* returns a usable link of the link group, or NULL */ static inline struct smc_link *smc_llc_usable_link(struct smc_link_group *lgr) { int i; for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) if (smc_link_usable(&lgr->lnk[i])) return &lgr->lnk[i]; return NULL; } /* set the termination reason code for the link group */ static inline void smc_llc_set_termination_rsn(struct smc_link_group *lgr, u32 rsn) { if (!lgr->llc_termination_rsn) lgr->llc_termination_rsn = rsn; } /* transmit */ int smc_llc_send_confirm_link(struct smc_link *lnk, enum smc_llc_reqresp reqresp); int smc_llc_send_add_link(struct smc_link *link, u8 mac[], u8 gid[], struct smc_link *link_new, enum smc_llc_reqresp reqresp); int smc_llc_send_delete_link(struct smc_link *link, u8 link_del_id, enum smc_llc_reqresp reqresp, bool orderly, u32 reason); void smc_llc_srv_delete_link_local(struct smc_link *link, u8 del_link_id); void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc); void smc_llc_lgr_clear(struct smc_link_group *lgr); int smc_llc_link_init(struct smc_link *link); void smc_llc_link_active(struct smc_link *link); void smc_llc_link_clear(struct smc_link *link, bool log); int smc_llc_do_confirm_rkey(struct smc_link *send_link, struct smc_buf_desc *rmb_desc); int smc_llc_do_delete_rkey(struct smc_link_group *lgr, struct smc_buf_desc *rmb_desc); int smc_llc_flow_initiate(struct smc_link_group *lgr, enum smc_llc_flowtype type); void smc_llc_flow_stop(struct smc_link_group *lgr, struct smc_llc_flow *flow); int smc_llc_eval_conf_link(struct smc_llc_qentry *qentry, enum smc_llc_reqresp type); void smc_llc_link_set_uid(struct smc_link *link); void smc_llc_save_peer_uid(struct smc_llc_qentry *qentry); struct smc_llc_qentry *smc_llc_wait(struct smc_link_group *lgr, struct smc_link *lnk, int time_out, u8 exp_msg); struct smc_llc_qentry *smc_llc_flow_qentry_clr(struct smc_llc_flow *flow); void smc_llc_flow_qentry_del(struct smc_llc_flow *flow); void smc_llc_send_link_delete_all(struct smc_link_group *lgr, bool ord, u32 rsn); int smc_llc_cli_add_link(struct smc_link *link, struct smc_llc_qentry *qentry); int smc_llc_srv_add_link(struct smc_link *link, struct smc_llc_qentry *req_qentry); void smc_llc_add_link_local(struct smc_link *link); int smc_llc_init(void) __init; #endif /* SMC_LLC_H */ |
| 64 64 1126 55 64 1568 1627 1159 16 1286 1172 1173 785 788 787 787 488 1 788 493 490 490 788 787 788 788 788 784 8 44 2 35 9 490 254 746 42 297 490 787 788 788 787 788 787 440 779 782 196 586 779 613 678 1157 1073 84 3 82 67 10 1 1 1 26 1123 1156 1722 4 1720 145 145 160 2 2 1 146 1 7 147 1 1 148 1 146 1708 1706 5 35 33 2 1122 1111 5 3 14 589 577 1 1 1 1 7 1 6 485 5 479 608 1 2 10 10 590 586 5 588 361 27 83 67 49 272 316 29 561 5 548 35 1 247 237 1 454 485 4 480 489 3 984 986 628 629 1005 670 124 214 2 998 2 1 1 1 12 881 96 1 105 101 100 882 788 195 13 25 596 100 533 626 1 1 69 1 8 60 1 46 1 6 45 17 1 1 6 1 1 1 1 1 2 2 4 4 1 2 2 145 2 2 2 1 138 1 137 1 10 1 3 6 5 48 3 15 19 1 10 1 2 1 1 1 1 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 | // SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2017 Facebook */ #include <linux/bpf.h> #include <linux/btf.h> #include <linux/btf_ids.h> #include <linux/slab.h> #include <linux/init.h> #include <linux/vmalloc.h> #include <linux/etherdevice.h> #include <linux/filter.h> #include <linux/rcupdate_trace.h> #include <linux/sched/signal.h> #include <net/bpf_sk_storage.h> #include <net/hotdata.h> #include <net/sock.h> #include <net/tcp.h> #include <net/net_namespace.h> #include <net/page_pool/helpers.h> #include <linux/error-injection.h> #include <linux/smp.h> #include <linux/sock_diag.h> #include <linux/netfilter.h> #include <net/netdev_rx_queue.h> #include <net/xdp.h> #include <net/netfilter/nf_bpf_link.h> #define CREATE_TRACE_POINTS #include <trace/events/bpf_test_run.h> struct bpf_test_timer { enum { NO_PREEMPT, NO_MIGRATE } mode; u32 i; u64 time_start, time_spent; }; static void bpf_test_timer_enter(struct bpf_test_timer *t) __acquires(rcu) { rcu_read_lock(); if (t->mode == NO_PREEMPT) preempt_disable(); else migrate_disable(); t->time_start = ktime_get_ns(); } static void bpf_test_timer_leave(struct bpf_test_timer *t) __releases(rcu) { t->time_start = 0; if (t->mode == NO_PREEMPT) preempt_enable(); else migrate_enable(); rcu_read_unlock(); } static bool bpf_test_timer_continue(struct bpf_test_timer *t, int iterations, u32 repeat, int *err, u32 *duration) __must_hold(rcu) { t->i += iterations; if (t->i >= repeat) { /* We're done. */ t->time_spent += ktime_get_ns() - t->time_start; do_div(t->time_spent, t->i); *duration = t->time_spent > U32_MAX ? U32_MAX : (u32)t->time_spent; *err = 0; goto reset; } if (signal_pending(current)) { /* During iteration: we've been cancelled, abort. */ *err = -EINTR; goto reset; } if (need_resched()) { /* During iteration: we need to reschedule between runs. */ t->time_spent += ktime_get_ns() - t->time_start; bpf_test_timer_leave(t); cond_resched(); bpf_test_timer_enter(t); } /* Do another round. */ return true; reset: t->i = 0; return false; } /* We put this struct at the head of each page with a context and frame * initialised when the page is allocated, so we don't have to do this on each * repetition of the test run. */ struct xdp_page_head { struct xdp_buff orig_ctx; struct xdp_buff ctx; union { /* ::data_hard_start starts here */ DECLARE_FLEX_ARRAY(struct xdp_frame, frame); DECLARE_FLEX_ARRAY(u8, data); }; }; struct xdp_test_data { struct xdp_buff *orig_ctx; struct xdp_rxq_info rxq; struct net_device *dev; struct page_pool *pp; struct xdp_frame **frames; struct sk_buff **skbs; struct xdp_mem_info mem; u32 batch_size; u32 frame_cnt; }; /* tools/testing/selftests/bpf/prog_tests/xdp_do_redirect.c:%MAX_PKT_SIZE * must be updated accordingly this gets changed, otherwise BPF selftests * will fail. */ #define TEST_XDP_FRAME_SIZE (PAGE_SIZE - sizeof(struct xdp_page_head)) #define TEST_XDP_MAX_BATCH 256 static void xdp_test_run_init_page(netmem_ref netmem, void *arg) { struct xdp_page_head *head = phys_to_virt(page_to_phys(netmem_to_page(netmem))); struct xdp_buff *new_ctx, *orig_ctx; u32 headroom = XDP_PACKET_HEADROOM; struct xdp_test_data *xdp = arg; size_t frm_len, meta_len; struct xdp_frame *frm; void *data; orig_ctx = xdp->orig_ctx; frm_len = orig_ctx->data_end - orig_ctx->data_meta; meta_len = orig_ctx->data - orig_ctx->data_meta; headroom -= meta_len; new_ctx = &head->ctx; frm = head->frame; data = head->data; memcpy(data + headroom, orig_ctx->data_meta, frm_len); xdp_init_buff(new_ctx, TEST_XDP_FRAME_SIZE, &xdp->rxq); xdp_prepare_buff(new_ctx, data, headroom, frm_len, true); new_ctx->data = new_ctx->data_meta + meta_len; xdp_update_frame_from_buff(new_ctx, frm); frm->mem_type = new_ctx->rxq->mem.type; memcpy(&head->orig_ctx, new_ctx, sizeof(head->orig_ctx)); } static int xdp_test_run_setup(struct xdp_test_data *xdp, struct xdp_buff *orig_ctx) { struct page_pool *pp; int err = -ENOMEM; struct page_pool_params pp_params = { .order = 0, .flags = 0, .pool_size = xdp->batch_size, .nid = NUMA_NO_NODE, .init_callback = xdp_test_run_init_page, .init_arg = xdp, }; xdp->frames = kvmalloc_array(xdp->batch_size, sizeof(void *), GFP_KERNEL); if (!xdp->frames) return -ENOMEM; xdp->skbs = kvmalloc_array(xdp->batch_size, sizeof(void *), GFP_KERNEL); if (!xdp->skbs) goto err_skbs; pp = page_pool_create(&pp_params); if (IS_ERR(pp)) { err = PTR_ERR(pp); goto err_pp; } /* will copy 'mem.id' into pp->xdp_mem_id */ err = xdp_reg_mem_model(&xdp->mem, MEM_TYPE_PAGE_POOL, pp); if (err) goto err_mmodel; xdp->pp = pp; /* We create a 'fake' RXQ referencing the original dev, but with an * xdp_mem_info pointing to our page_pool */ xdp_rxq_info_reg(&xdp->rxq, orig_ctx->rxq->dev, 0, 0); xdp->rxq.mem.type = MEM_TYPE_PAGE_POOL; xdp->rxq.mem.id = pp->xdp_mem_id; xdp->dev = orig_ctx->rxq->dev; xdp->orig_ctx = orig_ctx; return 0; err_mmodel: page_pool_destroy(pp); err_pp: kvfree(xdp->skbs); err_skbs: kvfree(xdp->frames); return err; } static void xdp_test_run_teardown(struct xdp_test_data *xdp) { xdp_unreg_mem_model(&xdp->mem); page_pool_destroy(xdp->pp); kfree(xdp->frames); kfree(xdp->skbs); } static bool frame_was_changed(const struct xdp_page_head *head) { /* xdp_scrub_frame() zeroes the data pointer, flags is the last field, * i.e. has the highest chances to be overwritten. If those two are * untouched, it's most likely safe to skip the context reset. */ return head->frame->data != head->orig_ctx.data || head->frame->flags != head->orig_ctx.flags; } static bool ctx_was_changed(struct xdp_page_head *head) { return head->orig_ctx.data != head->ctx.data || head->orig_ctx.data_meta != head->ctx.data_meta || head->orig_ctx.data_end != head->ctx.data_end; } static void reset_ctx(struct xdp_page_head *head) { if (likely(!frame_was_changed(head) && !ctx_was_changed(head))) return; head->ctx.data = head->orig_ctx.data; head->ctx.data_meta = head->orig_ctx.data_meta; head->ctx.data_end = head->orig_ctx.data_end; xdp_update_frame_from_buff(&head->ctx, head->frame); head->frame->mem_type = head->orig_ctx.rxq->mem.type; } static int xdp_recv_frames(struct xdp_frame **frames, int nframes, struct sk_buff **skbs, struct net_device *dev) { gfp_t gfp = __GFP_ZERO | GFP_ATOMIC; int i, n; LIST_HEAD(list); n = kmem_cache_alloc_bulk(net_hotdata.skbuff_cache, gfp, nframes, (void **)skbs); if (unlikely(n == 0)) { for (i = 0; i < nframes; i++) xdp_return_frame(frames[i]); return -ENOMEM; } for (i = 0; i < nframes; i++) { struct xdp_frame *xdpf = frames[i]; struct sk_buff *skb = skbs[i]; skb = __xdp_build_skb_from_frame(xdpf, skb, dev); if (!skb) { xdp_return_frame(xdpf); continue; } list_add_tail(&skb->list, &list); } netif_receive_skb_list(&list); return 0; } static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog, u32 repeat) { struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx; int err = 0, act, ret, i, nframes = 0, batch_sz; struct xdp_frame **frames = xdp->frames; struct bpf_redirect_info *ri; struct xdp_page_head *head; struct xdp_frame *frm; bool redirect = false; struct xdp_buff *ctx; struct page *page; batch_sz = min_t(u32, repeat, xdp->batch_size); local_bh_disable(); bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx); ri = bpf_net_ctx_get_ri(); xdp_set_return_frame_no_direct(); for (i = 0; i < batch_sz; i++) { page = page_pool_dev_alloc_pages(xdp->pp); if (!page) { err = -ENOMEM; goto out; } head = phys_to_virt(page_to_phys(page)); reset_ctx(head); ctx = &head->ctx; frm = head->frame; xdp->frame_cnt++; act = bpf_prog_run_xdp(prog, ctx); /* if program changed pkt bounds we need to update the xdp_frame */ if (unlikely(ctx_was_changed(head))) { ret = xdp_update_frame_from_buff(ctx, frm); if (ret) { xdp_return_buff(ctx); continue; } } switch (act) { case XDP_TX: /* we can't do a real XDP_TX since we're not in the * driver, so turn it into a REDIRECT back to the same * index */ ri->tgt_index = xdp->dev->ifindex; ri->map_id = INT_MAX; ri->map_type = BPF_MAP_TYPE_UNSPEC; fallthrough; case XDP_REDIRECT: redirect = true; ret = xdp_do_redirect_frame(xdp->dev, ctx, frm, prog); if (ret) xdp_return_buff(ctx); break; case XDP_PASS: frames[nframes++] = frm; break; default: bpf_warn_invalid_xdp_action(NULL, prog, act); fallthrough; case XDP_DROP: xdp_return_buff(ctx); break; } } out: if (redirect) xdp_do_flush(); if (nframes) { ret = xdp_recv_frames(frames, nframes, xdp->skbs, xdp->dev); if (ret) err = ret; } xdp_clear_return_frame_no_direct(); bpf_net_ctx_clear(bpf_net_ctx); local_bh_enable(); return err; } static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx, u32 repeat, u32 batch_size, u32 *time) { struct xdp_test_data xdp = { .batch_size = batch_size }; struct bpf_test_timer t = { .mode = NO_MIGRATE }; int ret; if (!repeat) repeat = 1; ret = xdp_test_run_setup(&xdp, ctx); if (ret) return ret; bpf_test_timer_enter(&t); do { xdp.frame_cnt = 0; ret = xdp_test_run_batch(&xdp, prog, repeat - t.i); if (unlikely(ret < 0)) break; } while (bpf_test_timer_continue(&t, xdp.frame_cnt, repeat, &ret, time)); bpf_test_timer_leave(&t); xdp_test_run_teardown(&xdp); return ret; } static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, u32 *retval, u32 *time, bool xdp) { struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx; struct bpf_prog_array_item item = {.prog = prog}; struct bpf_run_ctx *old_ctx; struct bpf_cg_run_ctx run_ctx; struct bpf_test_timer t = { NO_MIGRATE }; enum bpf_cgroup_storage_type stype; int ret; for_each_cgroup_storage_type(stype) { item.cgroup_storage[stype] = bpf_cgroup_storage_alloc(prog, stype); if (IS_ERR(item.cgroup_storage[stype])) { item.cgroup_storage[stype] = NULL; for_each_cgroup_storage_type(stype) bpf_cgroup_storage_free(item.cgroup_storage[stype]); return -ENOMEM; } } if (!repeat) repeat = 1; bpf_test_timer_enter(&t); old_ctx = bpf_set_run_ctx(&run_ctx.run_ctx); do { run_ctx.prog_item = &item; local_bh_disable(); bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx); if (xdp) *retval = bpf_prog_run_xdp(prog, ctx); else *retval = bpf_prog_run(prog, ctx); bpf_net_ctx_clear(bpf_net_ctx); local_bh_enable(); } while (bpf_test_timer_continue(&t, 1, repeat, &ret, time)); bpf_reset_run_ctx(old_ctx); bpf_test_timer_leave(&t); for_each_cgroup_storage_type(stype) bpf_cgroup_storage_free(item.cgroup_storage[stype]); return ret; } static int bpf_test_finish(const union bpf_attr *kattr, union bpf_attr __user *uattr, const void *data, struct skb_shared_info *sinfo, u32 size, u32 retval, u32 duration) { void __user *data_out = u64_to_user_ptr(kattr->test.data_out); int err = -EFAULT; u32 copy_size = size; /* Clamp copy if the user has provided a size hint, but copy the full * buffer if not to retain old behaviour. */ if (kattr->test.data_size_out && copy_size > kattr->test.data_size_out) { copy_size = kattr->test.data_size_out; err = -ENOSPC; } if (data_out) { int len = sinfo ? copy_size - sinfo->xdp_frags_size : copy_size; if (len < 0) { err = -ENOSPC; goto out; } if (copy_to_user(data_out, data, len)) goto out; if (sinfo) { int i, offset = len; u32 data_len; for (i = 0; i < sinfo->nr_frags; i++) { skb_frag_t *frag = &sinfo->frags[i]; if (offset >= copy_size) { err = -ENOSPC; break; } data_len = min_t(u32, copy_size - offset, skb_frag_size(frag)); if (copy_to_user(data_out + offset, skb_frag_address(frag), data_len)) goto out; offset += data_len; } } } if (copy_to_user(&uattr->test.data_size_out, &size, sizeof(size))) goto out; if (copy_to_user(&uattr->test.retval, &retval, sizeof(retval))) goto out; if (copy_to_user(&uattr->test.duration, &duration, sizeof(duration))) goto out; if (err != -ENOSPC) err = 0; out: trace_bpf_test_finish(&err); return err; } /* Integer types of various sizes and pointer combinations cover variety of * architecture dependent calling conventions. 7+ can be supported in the * future. */ __bpf_kfunc_start_defs(); __bpf_kfunc int bpf_fentry_test1(int a) { return a + 1; } EXPORT_SYMBOL_GPL(bpf_fentry_test1); int noinline bpf_fentry_test2(int a, u64 b) { return a + b; } int noinline bpf_fentry_test3(char a, int b, u64 c) { return a + b + c; } int noinline bpf_fentry_test4(void *a, char b, int c, u64 d) { return (long)a + b + c + d; } int noinline bpf_fentry_test5(u64 a, void *b, short c, int d, u64 e) { return a + (long)b + c + d + e; } int noinline bpf_fentry_test6(u64 a, void *b, short c, int d, void *e, u64 f) { return a + (long)b + c + d + (long)e + f; } struct bpf_fentry_test_t { struct bpf_fentry_test_t *a; }; int noinline bpf_fentry_test7(struct bpf_fentry_test_t *arg) { asm volatile ("": "+r"(arg)); return (long)arg; } int noinline bpf_fentry_test8(struct bpf_fentry_test_t *arg) { return (long)arg->a; } __bpf_kfunc u32 bpf_fentry_test9(u32 *a) { return *a; } int noinline bpf_fentry_test10(const void *a) { return (long)a; } void noinline bpf_fentry_test_sinfo(struct skb_shared_info *sinfo) { } __bpf_kfunc int bpf_modify_return_test(int a, int *b) { *b += 1; return a + *b; } __bpf_kfunc int bpf_modify_return_test2(int a, int *b, short c, int d, void *e, char f, int g) { *b += 1; return a + *b + c + d + (long)e + f + g; } __bpf_kfunc int bpf_modify_return_test_tp(int nonce) { trace_bpf_trigger_tp(nonce); return nonce; } int noinline bpf_fentry_shadow_test(int a) { return a + 1; } struct prog_test_member1 { int a; }; struct prog_test_member { struct prog_test_member1 m; int c; }; struct prog_test_ref_kfunc { int a; int b; struct prog_test_member memb; struct prog_test_ref_kfunc *next; refcount_t cnt; }; __bpf_kfunc void bpf_kfunc_call_test_release(struct prog_test_ref_kfunc *p) { refcount_dec(&p->cnt); } __bpf_kfunc void bpf_kfunc_call_test_release_dtor(void *p) { bpf_kfunc_call_test_release(p); } CFI_NOSEAL(bpf_kfunc_call_test_release_dtor); __bpf_kfunc void bpf_kfunc_call_memb_release(struct prog_test_member *p) { } __bpf_kfunc void bpf_kfunc_call_memb_release_dtor(void *p) { } CFI_NOSEAL(bpf_kfunc_call_memb_release_dtor); __bpf_kfunc_end_defs(); BTF_KFUNCS_START(bpf_test_modify_return_ids) BTF_ID_FLAGS(func, bpf_modify_return_test) BTF_ID_FLAGS(func, bpf_modify_return_test2) BTF_ID_FLAGS(func, bpf_modify_return_test_tp) BTF_ID_FLAGS(func, bpf_fentry_test1, KF_SLEEPABLE) BTF_KFUNCS_END(bpf_test_modify_return_ids) static const struct btf_kfunc_id_set bpf_test_modify_return_set = { .owner = THIS_MODULE, .set = &bpf_test_modify_return_ids, }; BTF_KFUNCS_START(test_sk_check_kfunc_ids) BTF_ID_FLAGS(func, bpf_kfunc_call_test_release, KF_RELEASE) BTF_ID_FLAGS(func, bpf_kfunc_call_memb_release, KF_RELEASE) BTF_KFUNCS_END(test_sk_check_kfunc_ids) static void *bpf_test_init(const union bpf_attr *kattr, u32 user_size, u32 size, u32 headroom, u32 tailroom) { void __user *data_in = u64_to_user_ptr(kattr->test.data_in); void *data; if (user_size < ETH_HLEN || user_size > PAGE_SIZE - headroom - tailroom) return ERR_PTR(-EINVAL); size = SKB_DATA_ALIGN(size); data = kzalloc(size + headroom + tailroom, GFP_USER); if (!data) return ERR_PTR(-ENOMEM); if (copy_from_user(data + headroom, data_in, user_size)) { kfree(data); return ERR_PTR(-EFAULT); } return data; } int bpf_prog_test_run_tracing(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr) { struct bpf_fentry_test_t arg = {}; u16 side_effect = 0, ret = 0; int b = 2, err = -EFAULT; u32 retval = 0; if (kattr->test.flags || kattr->test.cpu || kattr->test.batch_size) return -EINVAL; switch (prog->expected_attach_type) { case BPF_TRACE_FENTRY: case BPF_TRACE_FEXIT: if (bpf_fentry_test1(1) != 2 || bpf_fentry_test2(2, 3) != 5 || bpf_fentry_test3(4, 5, 6) != 15 || bpf_fentry_test4((void *)7, 8, 9, 10) != 34 || bpf_fentry_test5(11, (void *)12, 13, 14, 15) != 65 || bpf_fentry_test6(16, (void *)17, 18, 19, (void *)20, 21) != 111 || bpf_fentry_test7((struct bpf_fentry_test_t *)0) != 0 || bpf_fentry_test8(&arg) != 0 || bpf_fentry_test9(&retval) != 0 || bpf_fentry_test10((void *)0) != 0) goto out; break; case BPF_MODIFY_RETURN: ret = bpf_modify_return_test(1, &b); if (b != 2) side_effect++; b = 2; ret += bpf_modify_return_test2(1, &b, 3, 4, (void *)5, 6, 7); if (b != 2) side_effect++; break; default: goto out; } retval = ((u32)side_effect << 16) | ret; if (copy_to_user(&uattr->test.retval, &retval, sizeof(retval))) goto out; err = 0; out: trace_bpf_test_finish(&err); return err; } struct bpf_raw_tp_test_run_info { struct bpf_prog *prog; void *ctx; u32 retval; }; static void __bpf_prog_test_run_raw_tp(void *data) { struct bpf_raw_tp_test_run_info *info = data; struct bpf_trace_run_ctx run_ctx = {}; struct bpf_run_ctx *old_run_ctx; old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx); rcu_read_lock(); info->retval = bpf_prog_run(info->prog, info->ctx); rcu_read_unlock(); bpf_reset_run_ctx(old_run_ctx); } int bpf_prog_test_run_raw_tp(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr) { void __user *ctx_in = u64_to_user_ptr(kattr->test.ctx_in); __u32 ctx_size_in = kattr->test.ctx_size_in; struct bpf_raw_tp_test_run_info info; int cpu = kattr->test.cpu, err = 0; int current_cpu; /* doesn't support data_in/out, ctx_out, duration, or repeat */ if (kattr->test.data_in || kattr->test.data_out || kattr->test.ctx_out || kattr->test.duration || kattr->test.repeat || kattr->test.batch_size) return -EINVAL; if (ctx_size_in < prog->aux->max_ctx_offset || ctx_size_in > MAX_BPF_FUNC_ARGS * sizeof(u64)) return -EINVAL; if ((kattr->test.flags & BPF_F_TEST_RUN_ON_CPU) == 0 && cpu != 0) return -EINVAL; if (ctx_size_in) { info.ctx = memdup_user(ctx_in, ctx_size_in); if (IS_ERR(info.ctx)) return PTR_ERR(info.ctx); } else { info.ctx = NULL; } info.prog = prog; current_cpu = get_cpu(); if ((kattr->test.flags & BPF_F_TEST_RUN_ON_CPU) == 0 || cpu == current_cpu) { __bpf_prog_test_run_raw_tp(&info); } else if (cpu >= nr_cpu_ids || !cpu_online(cpu)) { /* smp_call_function_single() also checks cpu_online() * after csd_lock(). However, since cpu is from user * space, let's do an extra quick check to filter out * invalid value before smp_call_function_single(). */ err = -ENXIO; } else { err = smp_call_function_single(cpu, __bpf_prog_test_run_raw_tp, &info, 1); } put_cpu(); if (!err && copy_to_user(&uattr->test.retval, &info.retval, sizeof(u32))) err = -EFAULT; kfree(info.ctx); return err; } static void *bpf_ctx_init(const union bpf_attr *kattr, u32 max_size) { void __user *data_in = u64_to_user_ptr(kattr->test.ctx_in); void __user *data_out = u64_to_user_ptr(kattr->test.ctx_out); u32 size = kattr->test.ctx_size_in; void *data; int err; if (!data_in && !data_out) return NULL; data = kzalloc(max_size, GFP_USER); if (!data) return ERR_PTR(-ENOMEM); if (data_in) { err = bpf_check_uarg_tail_zero(USER_BPFPTR(data_in), max_size, size); if (err) { kfree(data); return ERR_PTR(err); } size = min_t(u32, max_size, size); if (copy_from_user(data, data_in, size)) { kfree(data); return ERR_PTR(-EFAULT); } } return data; } static int bpf_ctx_finish(const union bpf_attr *kattr, union bpf_attr __user *uattr, const void *data, u32 size) { void __user *data_out = u64_to_user_ptr(kattr->test.ctx_out); int err = -EFAULT; u32 copy_size = size; if (!data || !data_out) return 0; if (copy_size > kattr->test.ctx_size_out) { copy_size = kattr->test.ctx_size_out; err = -ENOSPC; } if (copy_to_user(data_out, data, copy_size)) goto out; if (copy_to_user(&uattr->test.ctx_size_out, &size, sizeof(size))) goto out; if (err != -ENOSPC) err = 0; out: return err; } /** * range_is_zero - test whether buffer is initialized * @buf: buffer to check * @from: check from this position * @to: check up until (excluding) this position * * This function returns true if the there is a non-zero byte * in the buf in the range [from,to). */ static inline bool range_is_zero(void *buf, size_t from, size_t to) { return !memchr_inv((u8 *)buf + from, 0, to - from); } static int convert___skb_to_skb(struct sk_buff *skb, struct __sk_buff *__skb) { struct qdisc_skb_cb *cb = (struct qdisc_skb_cb *)skb->cb; if (!__skb) return 0; /* make sure the fields we don't use are zeroed */ if (!range_is_zero(__skb, 0, offsetof(struct __sk_buff, mark))) return -EINVAL; /* mark is allowed */ if (!range_is_zero(__skb, offsetofend(struct __sk_buff, mark), offsetof(struct __sk_buff, priority))) return -EINVAL; /* priority is allowed */ /* ingress_ifindex is allowed */ /* ifindex is allowed */ if (!range_is_zero(__skb, offsetofend(struct __sk_buff, ifindex), offsetof(struct __sk_buff, cb))) return -EINVAL; /* cb is allowed */ if (!range_is_zero(__skb, offsetofend(struct __sk_buff, cb), offsetof(struct __sk_buff, tstamp))) return -EINVAL; /* tstamp is allowed */ /* wire_len is allowed */ /* gso_segs is allowed */ if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_segs), offsetof(struct __sk_buff, gso_size))) return -EINVAL; /* gso_size is allowed */ if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_size), offsetof(struct __sk_buff, hwtstamp))) return -EINVAL; /* hwtstamp is allowed */ if (!range_is_zero(__skb, offsetofend(struct __sk_buff, hwtstamp), sizeof(struct __sk_buff))) return -EINVAL; skb->mark = __skb->mark; skb->priority = __skb->priority; skb->skb_iif = __skb->ingress_ifindex; skb->tstamp = __skb->tstamp; memcpy(&cb->data, __skb->cb, QDISC_CB_PRIV_LEN); if (__skb->wire_len == 0) { cb->pkt_len = skb->len; } else { if (__skb->wire_len < skb->len || __skb->wire_len > GSO_LEGACY_MAX_SIZE) return -EINVAL; cb->pkt_len = __skb->wire_len; } if (__skb->gso_segs > GSO_MAX_SEGS) return -EINVAL; skb_shinfo(skb)->gso_segs = __skb->gso_segs; skb_shinfo(skb)->gso_size = __skb->gso_size; skb_shinfo(skb)->hwtstamps.hwtstamp = __skb->hwtstamp; return 0; } static void convert_skb_to___skb(struct sk_buff *skb, struct __sk_buff *__skb) { struct qdisc_skb_cb *cb = (struct qdisc_skb_cb *)skb->cb; if (!__skb) return; __skb->mark = skb->mark; __skb->priority = skb->priority; __skb->ingress_ifindex = skb->skb_iif; __skb->ifindex = skb->dev->ifindex; __skb->tstamp = skb->tstamp; memcpy(__skb->cb, &cb->data, QDISC_CB_PRIV_LEN); __skb->wire_len = cb->pkt_len; __skb->gso_segs = skb_shinfo(skb)->gso_segs; __skb->hwtstamp = skb_shinfo(skb)->hwtstamps.hwtstamp; } static struct proto bpf_dummy_proto = { .name = "bpf_dummy", .owner = THIS_MODULE, .obj_size = sizeof(struct sock), }; int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr) { bool is_l2 = false, is_direct_pkt_access = false; struct net *net = current->nsproxy->net_ns; struct net_device *dev = net->loopback_dev; u32 size = kattr->test.data_size_in; u32 repeat = kattr->test.repeat; struct __sk_buff *ctx = NULL; u32 retval, duration; int hh_len = ETH_HLEN; struct sk_buff *skb; struct sock *sk; void *data; int ret; if ((kattr->test.flags & ~BPF_F_TEST_SKB_CHECKSUM_COMPLETE) || kattr->test.cpu || kattr->test.batch_size) return -EINVAL; data = bpf_test_init(kattr, kattr->test.data_size_in, size, NET_SKB_PAD + NET_IP_ALIGN, SKB_DATA_ALIGN(sizeof(struct skb_shared_info))); if (IS_ERR(data)) return PTR_ERR(data); ctx = bpf_ctx_init(kattr, sizeof(struct __sk_buff)); if (IS_ERR(ctx)) { kfree(data); return PTR_ERR(ctx); } switch (prog->type) { case BPF_PROG_TYPE_SCHED_CLS: case BPF_PROG_TYPE_SCHED_ACT: is_l2 = true; fallthrough; case BPF_PROG_TYPE_LWT_IN: case BPF_PROG_TYPE_LWT_OUT: case BPF_PROG_TYPE_LWT_XMIT: case BPF_PROG_TYPE_CGROUP_SKB: is_direct_pkt_access = true; break; default: break; } sk = sk_alloc(net, AF_UNSPEC, GFP_USER, &bpf_dummy_proto, 1); if (!sk) { kfree(data); kfree(ctx); return -ENOMEM; } sock_init_data(NULL, sk); skb = slab_build_skb(data); if (!skb) { kfree(data); kfree(ctx); sk_free(sk); return -ENOMEM; } skb->sk = sk; skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN); __skb_put(skb, size); if (ctx && ctx->ifindex > 1) { dev = dev_get_by_index(net, ctx->ifindex); if (!dev) { ret = -ENODEV; goto out; } } skb->protocol = eth_type_trans(skb, dev); skb_reset_network_header(skb); switch (skb->protocol) { case htons(ETH_P_IP): sk->sk_family = AF_INET; if (sizeof(struct iphdr) <= skb_headlen(skb)) { sk->sk_rcv_saddr = ip_hdr(skb)->saddr; sk->sk_daddr = ip_hdr(skb)->daddr; } break; #if IS_ENABLED(CONFIG_IPV6) case htons(ETH_P_IPV6): sk->sk_family = AF_INET6; if (sizeof(struct ipv6hdr) <= skb_headlen(skb)) { sk->sk_v6_rcv_saddr = ipv6_hdr(skb)->saddr; sk->sk_v6_daddr = ipv6_hdr(skb)->daddr; } break; #endif default: break; } if (is_l2) __skb_push(skb, hh_len); if (is_direct_pkt_access) bpf_compute_data_pointers(skb); ret = convert___skb_to_skb(skb, ctx); if (ret) goto out; if (kattr->test.flags & BPF_F_TEST_SKB_CHECKSUM_COMPLETE) { const int off = skb_network_offset(skb); int len = skb->len - off; skb->csum = skb_checksum(skb, off, len, 0); skb->ip_summed = CHECKSUM_COMPLETE; } ret = bpf_test_run(prog, skb, repeat, &retval, &duration, false); if (ret) goto out; if (!is_l2) { if (skb_headroom(skb) < hh_len) { int nhead = HH_DATA_ALIGN(hh_len - skb_headroom(skb)); if (pskb_expand_head(skb, nhead, 0, GFP_USER)) { ret = -ENOMEM; goto out; } } memset(__skb_push(skb, hh_len), 0, hh_len); } if (kattr->test.flags & BPF_F_TEST_SKB_CHECKSUM_COMPLETE) { const int off = skb_network_offset(skb); int len = skb->len - off; __wsum csum; csum = skb_checksum(skb, off, len, 0); if (csum_fold(skb->csum) != csum_fold(csum)) { ret = -EBADMSG; goto out; } } convert_skb_to___skb(skb, ctx); size = skb->len; /* bpf program can never convert linear skb to non-linear */ if (WARN_ON_ONCE(skb_is_nonlinear(skb))) size = skb_headlen(skb); ret = bpf_test_finish(kattr, uattr, skb->data, NULL, size, retval, duration); if (!ret) ret = bpf_ctx_finish(kattr, uattr, ctx, sizeof(struct __sk_buff)); out: if (dev && dev != net->loopback_dev) dev_put(dev); kfree_skb(skb); sk_free(sk); kfree(ctx); return ret; } static int xdp_convert_md_to_buff(struct xdp_md *xdp_md, struct xdp_buff *xdp) { unsigned int ingress_ifindex, rx_queue_index; struct netdev_rx_queue *rxqueue; struct net_device *device; if (!xdp_md) return 0; if (xdp_md->egress_ifindex != 0) return -EINVAL; ingress_ifindex = xdp_md->ingress_ifindex; rx_queue_index = xdp_md->rx_queue_index; if (!ingress_ifindex && rx_queue_index) return -EINVAL; if (ingress_ifindex) { device = dev_get_by_index(current->nsproxy->net_ns, ingress_ifindex); if (!device) return -ENODEV; if (rx_queue_index >= device->real_num_rx_queues) goto free_dev; rxqueue = __netif_get_rx_queue(device, rx_queue_index); if (!xdp_rxq_info_is_reg(&rxqueue->xdp_rxq)) goto free_dev; xdp->rxq = &rxqueue->xdp_rxq; /* The device is now tracked in the xdp->rxq for later * dev_put() */ } xdp->data = xdp->data_meta + xdp_md->data; return 0; free_dev: dev_put(device); return -EINVAL; } static void xdp_convert_buff_to_md(struct xdp_buff *xdp, struct xdp_md *xdp_md) { if (!xdp_md) return; xdp_md->data = xdp->data - xdp->data_meta; xdp_md->data_end = xdp->data_end - xdp->data_meta; if (xdp_md->ingress_ifindex) dev_put(xdp->rxq->dev); } int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr) { bool do_live = (kattr->test.flags & BPF_F_TEST_XDP_LIVE_FRAMES); u32 tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); u32 batch_size = kattr->test.batch_size; u32 retval = 0, duration, max_data_sz; u32 size = kattr->test.data_size_in; u32 headroom = XDP_PACKET_HEADROOM; u32 repeat = kattr->test.repeat; struct netdev_rx_queue *rxqueue; struct skb_shared_info *sinfo; struct xdp_buff xdp = {}; int i, ret = -EINVAL; struct xdp_md *ctx; void *data; if (prog->expected_attach_type == BPF_XDP_DEVMAP || prog->expected_attach_type == BPF_XDP_CPUMAP) return -EINVAL; if (kattr->test.flags & ~BPF_F_TEST_XDP_LIVE_FRAMES) return -EINVAL; if (bpf_prog_is_dev_bound(prog->aux)) return -EINVAL; if (do_live) { if (!batch_size) batch_size = NAPI_POLL_WEIGHT; else if (batch_size > TEST_XDP_MAX_BATCH) return -E2BIG; headroom += sizeof(struct xdp_page_head); } else if (batch_size) { return -EINVAL; } ctx = bpf_ctx_init(kattr, sizeof(struct xdp_md)); if (IS_ERR(ctx)) return PTR_ERR(ctx); if (ctx) { /* There can't be user provided data before the meta data */ if (ctx->data_meta || ctx->data_end != size || ctx->data > ctx->data_end || unlikely(xdp_metalen_invalid(ctx->data)) || (do_live && (kattr->test.data_out || kattr->test.ctx_out))) goto free_ctx; /* Meta data is allocated from the headroom */ headroom -= ctx->data; } max_data_sz = PAGE_SIZE - headroom - tailroom; if (size > max_data_sz) { /* disallow live data mode for jumbo frames */ if (do_live) goto free_ctx; size = max_data_sz; } data = bpf_test_init(kattr, size, max_data_sz, headroom, tailroom); if (IS_ERR(data)) { ret = PTR_ERR(data); goto free_ctx; } rxqueue = __netif_get_rx_queue(current->nsproxy->net_ns->loopback_dev, 0); rxqueue->xdp_rxq.frag_size = headroom + max_data_sz + tailroom; xdp_init_buff(&xdp, rxqueue->xdp_rxq.frag_size, &rxqueue->xdp_rxq); xdp_prepare_buff(&xdp, data, headroom, size, true); sinfo = xdp_get_shared_info_from_buff(&xdp); ret = xdp_convert_md_to_buff(ctx, &xdp); if (ret) goto free_data; if (unlikely(kattr->test.data_size_in > size)) { void __user *data_in = u64_to_user_ptr(kattr->test.data_in); while (size < kattr->test.data_size_in) { struct page *page; skb_frag_t *frag; u32 data_len; if (sinfo->nr_frags == MAX_SKB_FRAGS) { ret = -ENOMEM; goto out; } page = alloc_page(GFP_KERNEL); if (!page) { ret = -ENOMEM; goto out; } frag = &sinfo->frags[sinfo->nr_frags++]; data_len = min_t(u32, kattr->test.data_size_in - size, PAGE_SIZE); skb_frag_fill_page_desc(frag, page, 0, data_len); if (copy_from_user(page_address(page), data_in + size, data_len)) { ret = -EFAULT; goto out; } sinfo->xdp_frags_size += data_len; size += data_len; } xdp_buff_set_frags_flag(&xdp); } if (repeat > 1) bpf_prog_change_xdp(NULL, prog); if (do_live) ret = bpf_test_run_xdp_live(prog, &xdp, repeat, batch_size, &duration); else ret = bpf_test_run(prog, &xdp, repeat, &retval, &duration, true); /* We convert the xdp_buff back to an xdp_md before checking the return * code so the reference count of any held netdevice will be decremented * even if the test run failed. */ xdp_convert_buff_to_md(&xdp, ctx); if (ret) goto out; size = xdp.data_end - xdp.data_meta + sinfo->xdp_frags_size; ret = bpf_test_finish(kattr, uattr, xdp.data_meta, sinfo, size, retval, duration); if (!ret) ret = bpf_ctx_finish(kattr, uattr, ctx, sizeof(struct xdp_md)); out: if (repeat > 1) bpf_prog_change_xdp(prog, NULL); free_data: for (i = 0; i < sinfo->nr_frags; i++) __free_page(skb_frag_page(&sinfo->frags[i])); kfree(data); free_ctx: kfree(ctx); return ret; } static int verify_user_bpf_flow_keys(struct bpf_flow_keys *ctx) { /* make sure the fields we don't use are zeroed */ if (!range_is_zero(ctx, 0, offsetof(struct bpf_flow_keys, flags))) return -EINVAL; /* flags is allowed */ if (!range_is_zero(ctx, offsetofend(struct bpf_flow_keys, flags), sizeof(struct bpf_flow_keys))) return -EINVAL; return 0; } int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr) { struct bpf_test_timer t = { NO_PREEMPT }; u32 size = kattr->test.data_size_in; struct bpf_flow_dissector ctx = {}; u32 repeat = kattr->test.repeat; struct bpf_flow_keys *user_ctx; struct bpf_flow_keys flow_keys; const struct ethhdr *eth; unsigned int flags = 0; u32 retval, duration; void *data; int ret; if (kattr->test.flags || kattr->test.cpu || kattr->test.batch_size) return -EINVAL; if (size < ETH_HLEN) return -EINVAL; data = bpf_test_init(kattr, kattr->test.data_size_in, size, 0, 0); if (IS_ERR(data)) return PTR_ERR(data); eth = (struct ethhdr *)data; if (!repeat) repeat = 1; user_ctx = bpf_ctx_init(kattr, sizeof(struct bpf_flow_keys)); if (IS_ERR(user_ctx)) { kfree(data); return PTR_ERR(user_ctx); } if (user_ctx) { ret = verify_user_bpf_flow_keys(user_ctx); if (ret) goto out; flags = user_ctx->flags; } ctx.flow_keys = &flow_keys; ctx.data = data; ctx.data_end = (__u8 *)data + size; bpf_test_timer_enter(&t); do { retval = bpf_flow_dissect(prog, &ctx, eth->h_proto, ETH_HLEN, size, flags); } while (bpf_test_timer_continue(&t, 1, repeat, &ret, &duration)); bpf_test_timer_leave(&t); if (ret < 0) goto out; ret = bpf_test_finish(kattr, uattr, &flow_keys, NULL, sizeof(flow_keys), retval, duration); if (!ret) ret = bpf_ctx_finish(kattr, uattr, user_ctx, sizeof(struct bpf_flow_keys)); out: kfree(user_ctx); kfree(data); return ret; } int bpf_prog_test_run_sk_lookup(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr) { struct bpf_test_timer t = { NO_PREEMPT }; struct bpf_prog_array *progs = NULL; struct bpf_sk_lookup_kern ctx = {}; u32 repeat = kattr->test.repeat; struct bpf_sk_lookup *user_ctx; u32 retval, duration; int ret = -EINVAL; if (kattr->test.flags || kattr->test.cpu || kattr->test.batch_size) return -EINVAL; if (kattr->test.data_in || kattr->test.data_size_in || kattr->test.data_out || kattr->test.data_size_out) return -EINVAL; if (!repeat) repeat = 1; user_ctx = bpf_ctx_init(kattr, sizeof(*user_ctx)); if (IS_ERR(user_ctx)) return PTR_ERR(user_ctx); if (!user_ctx) return -EINVAL; if (user_ctx->sk) goto out; if (!range_is_zero(user_ctx, offsetofend(typeof(*user_ctx), local_port), sizeof(*user_ctx))) goto out; if (user_ctx->local_port > U16_MAX) { ret = -ERANGE; goto out; } ctx.family = (u16)user_ctx->family; ctx.protocol = (u16)user_ctx->protocol; ctx.dport = (u16)user_ctx->local_port; ctx.sport = user_ctx->remote_port; switch (ctx.family) { case AF_INET: ctx.v4.daddr = (__force __be32)user_ctx->local_ip4; ctx.v4.saddr = (__force __be32)user_ctx->remote_ip4; break; #if IS_ENABLED(CONFIG_IPV6) case AF_INET6: ctx.v6.daddr = (struct in6_addr *)user_ctx->local_ip6; ctx.v6.saddr = (struct in6_addr *)user_ctx->remote_ip6; break; #endif default: ret = -EAFNOSUPPORT; goto out; } progs = bpf_prog_array_alloc(1, GFP_KERNEL); if (!progs) { ret = -ENOMEM; goto out; } progs->items[0].prog = prog; bpf_test_timer_enter(&t); do { ctx.selected_sk = NULL; retval = BPF_PROG_SK_LOOKUP_RUN_ARRAY(progs, ctx, bpf_prog_run); } while (bpf_test_timer_continue(&t, 1, repeat, &ret, &duration)); bpf_test_timer_leave(&t); if (ret < 0) goto out; user_ctx->cookie = 0; if (ctx.selected_sk) { if (ctx.selected_sk->sk_reuseport && !ctx.no_reuseport) { ret = -EOPNOTSUPP; goto out; } user_ctx->cookie = sock_gen_cookie(ctx.selected_sk); } ret = bpf_test_finish(kattr, uattr, NULL, NULL, 0, retval, duration); if (!ret) ret = bpf_ctx_finish(kattr, uattr, user_ctx, sizeof(*user_ctx)); out: bpf_prog_array_free(progs); kfree(user_ctx); return ret; } int bpf_prog_test_run_syscall(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr) { void __user *ctx_in = u64_to_user_ptr(kattr->test.ctx_in); __u32 ctx_size_in = kattr->test.ctx_size_in; void *ctx = NULL; u32 retval; int err = 0; /* doesn't support data_in/out, ctx_out, duration, or repeat or flags */ if (kattr->test.data_in || kattr->test.data_out || kattr->test.ctx_out || kattr->test.duration || kattr->test.repeat || kattr->test.flags || kattr->test.batch_size) return -EINVAL; if (ctx_size_in < prog->aux->max_ctx_offset || ctx_size_in > U16_MAX) return -EINVAL; if (ctx_size_in) { ctx = memdup_user(ctx_in, ctx_size_in); if (IS_ERR(ctx)) return PTR_ERR(ctx); } rcu_read_lock_trace(); retval = bpf_prog_run_pin_on_cpu(prog, ctx); rcu_read_unlock_trace(); if (copy_to_user(&uattr->test.retval, &retval, sizeof(u32))) { err = -EFAULT; goto out; } if (ctx_size_in) if (copy_to_user(ctx_in, ctx, ctx_size_in)) err = -EFAULT; out: kfree(ctx); return err; } static int verify_and_copy_hook_state(struct nf_hook_state *state, const struct nf_hook_state *user, struct net_device *dev) { if (user->in || user->out) return -EINVAL; if (user->net || user->sk || user->okfn) return -EINVAL; switch (user->pf) { case NFPROTO_IPV4: case NFPROTO_IPV6: switch (state->hook) { case NF_INET_PRE_ROUTING: state->in = dev; break; case NF_INET_LOCAL_IN: state->in = dev; break; case NF_INET_FORWARD: state->in = dev; state->out = dev; break; case NF_INET_LOCAL_OUT: state->out = dev; break; case NF_INET_POST_ROUTING: state->out = dev; break; } break; default: return -EINVAL; } state->pf = user->pf; state->hook = user->hook; return 0; } static __be16 nfproto_eth(int nfproto) { switch (nfproto) { case NFPROTO_IPV4: return htons(ETH_P_IP); case NFPROTO_IPV6: break; } return htons(ETH_P_IPV6); } int bpf_prog_test_run_nf(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr) { struct net *net = current->nsproxy->net_ns; struct net_device *dev = net->loopback_dev; struct nf_hook_state *user_ctx, hook_state = { .pf = NFPROTO_IPV4, .hook = NF_INET_LOCAL_OUT, }; u32 size = kattr->test.data_size_in; u32 repeat = kattr->test.repeat; struct bpf_nf_ctx ctx = { .state = &hook_state, }; struct sk_buff *skb = NULL; u32 retval, duration; void *data; int ret; if (kattr->test.flags || kattr->test.cpu || kattr->test.batch_size) return -EINVAL; if (size < sizeof(struct iphdr)) return -EINVAL; data = bpf_test_init(kattr, kattr->test.data_size_in, size, NET_SKB_PAD + NET_IP_ALIGN, SKB_DATA_ALIGN(sizeof(struct skb_shared_info))); if (IS_ERR(data)) return PTR_ERR(data); if (!repeat) repeat = 1; user_ctx = bpf_ctx_init(kattr, sizeof(struct nf_hook_state)); if (IS_ERR(user_ctx)) { kfree(data); return PTR_ERR(user_ctx); } if (user_ctx) { ret = verify_and_copy_hook_state(&hook_state, user_ctx, dev); if (ret) goto out; } skb = slab_build_skb(data); if (!skb) { ret = -ENOMEM; goto out; } data = NULL; /* data released via kfree_skb */ skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN); __skb_put(skb, size); ret = -EINVAL; if (hook_state.hook != NF_INET_LOCAL_OUT) { if (size < ETH_HLEN + sizeof(struct iphdr)) goto out; skb->protocol = eth_type_trans(skb, dev); switch (skb->protocol) { case htons(ETH_P_IP): if (hook_state.pf == NFPROTO_IPV4) break; goto out; case htons(ETH_P_IPV6): if (size < ETH_HLEN + sizeof(struct ipv6hdr)) goto out; if (hook_state.pf == NFPROTO_IPV6) break; goto out; default: ret = -EPROTO; goto out; } skb_reset_network_header(skb); } else { skb->protocol = nfproto_eth(hook_state.pf); } ctx.skb = skb; ret = bpf_test_run(prog, &ctx, repeat, &retval, &duration, false); if (ret) goto out; ret = bpf_test_finish(kattr, uattr, NULL, NULL, 0, retval, duration); out: kfree(user_ctx); kfree_skb(skb); kfree(data); return ret; } static const struct btf_kfunc_id_set bpf_prog_test_kfunc_set = { .owner = THIS_MODULE, .set = &test_sk_check_kfunc_ids, }; BTF_ID_LIST(bpf_prog_test_dtor_kfunc_ids) BTF_ID(struct, prog_test_ref_kfunc) BTF_ID(func, bpf_kfunc_call_test_release_dtor) BTF_ID(struct, prog_test_member) BTF_ID(func, bpf_kfunc_call_memb_release_dtor) static int __init bpf_prog_test_run_init(void) { const struct btf_id_dtor_kfunc bpf_prog_test_dtor_kfunc[] = { { .btf_id = bpf_prog_test_dtor_kfunc_ids[0], .kfunc_btf_id = bpf_prog_test_dtor_kfunc_ids[1] }, { .btf_id = bpf_prog_test_dtor_kfunc_ids[2], .kfunc_btf_id = bpf_prog_test_dtor_kfunc_ids[3], }, }; int ret; ret = register_btf_fmodret_id_set(&bpf_test_modify_return_set); ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_prog_test_kfunc_set); ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_prog_test_kfunc_set); ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &bpf_prog_test_kfunc_set); return ret ?: register_btf_id_dtor_kfuncs(bpf_prog_test_dtor_kfunc, ARRAY_SIZE(bpf_prog_test_dtor_kfunc), THIS_MODULE); } late_initcall(bpf_prog_test_run_init); |
| 2 2 2 5 5 2 2 2 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | // SPDX-License-Identifier: GPL-2.0-only #include "netlink.h" #include "common.h" #include "bitset.h" struct privflags_req_info { struct ethnl_req_info base; }; struct privflags_reply_data { struct ethnl_reply_data base; const char (*priv_flag_names)[ETH_GSTRING_LEN]; unsigned int n_priv_flags; u32 priv_flags; }; #define PRIVFLAGS_REPDATA(__reply_base) \ container_of(__reply_base, struct privflags_reply_data, base) const struct nla_policy ethnl_privflags_get_policy[] = { [ETHTOOL_A_PRIVFLAGS_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy), }; static int ethnl_get_priv_flags_info(struct net_device *dev, unsigned int *count, const char (**names)[ETH_GSTRING_LEN]) { const struct ethtool_ops *ops = dev->ethtool_ops; int nflags; nflags = ops->get_sset_count(dev, ETH_SS_PRIV_FLAGS); if (nflags < 0) return nflags; if (names) { *names = kcalloc(nflags, ETH_GSTRING_LEN, GFP_KERNEL); if (!*names) return -ENOMEM; ops->get_strings(dev, ETH_SS_PRIV_FLAGS, (u8 *)*names); } /* We can pass more than 32 private flags to userspace via netlink but * we cannot get more with ethtool_ops::get_priv_flags(). Note that we * must not adjust nflags before allocating the space for flag names * as the buffer must be large enough for all flags. */ if (WARN_ONCE(nflags > 32, "device %s reports more than 32 private flags (%d)\n", netdev_name(dev), nflags)) nflags = 32; *count = nflags; return 0; } static int privflags_prepare_data(const struct ethnl_req_info *req_base, struct ethnl_reply_data *reply_base, const struct genl_info *info) { struct privflags_reply_data *data = PRIVFLAGS_REPDATA(reply_base); struct net_device *dev = reply_base->dev; const char (*names)[ETH_GSTRING_LEN]; const struct ethtool_ops *ops; unsigned int nflags; int ret; ops = dev->ethtool_ops; if (!ops->get_priv_flags || !ops->get_sset_count || !ops->get_strings) return -EOPNOTSUPP; ret = ethnl_ops_begin(dev); if (ret < 0) return ret; ret = ethnl_get_priv_flags_info(dev, &nflags, &names); if (ret < 0) goto out_ops; data->priv_flags = ops->get_priv_flags(dev); data->priv_flag_names = names; data->n_priv_flags = nflags; out_ops: ethnl_ops_complete(dev); return ret; } static int privflags_reply_size(const struct ethnl_req_info *req_base, const struct ethnl_reply_data *reply_base) { const struct privflags_reply_data *data = PRIVFLAGS_REPDATA(reply_base); bool compact = req_base->flags & ETHTOOL_FLAG_COMPACT_BITSETS; const u32 all_flags = ~(u32)0 >> (32 - data->n_priv_flags); return ethnl_bitset32_size(&data->priv_flags, &all_flags, data->n_priv_flags, data->priv_flag_names, compact); } static int privflags_fill_reply(struct sk_buff *skb, const struct ethnl_req_info *req_base, const struct ethnl_reply_data *reply_base) { const struct privflags_reply_data *data = PRIVFLAGS_REPDATA(reply_base); bool compact = req_base->flags & ETHTOOL_FLAG_COMPACT_BITSETS; const u32 all_flags = ~(u32)0 >> (32 - data->n_priv_flags); return ethnl_put_bitset32(skb, ETHTOOL_A_PRIVFLAGS_FLAGS, &data->priv_flags, &all_flags, data->n_priv_flags, data->priv_flag_names, compact); } static void privflags_cleanup_data(struct ethnl_reply_data *reply_data) { struct privflags_reply_data *data = PRIVFLAGS_REPDATA(reply_data); kfree(data->priv_flag_names); } /* PRIVFLAGS_SET */ const struct nla_policy ethnl_privflags_set_policy[] = { [ETHTOOL_A_PRIVFLAGS_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy), [ETHTOOL_A_PRIVFLAGS_FLAGS] = { .type = NLA_NESTED }, }; static int ethnl_set_privflags_validate(struct ethnl_req_info *req_info, struct genl_info *info) { const struct ethtool_ops *ops = req_info->dev->ethtool_ops; if (!info->attrs[ETHTOOL_A_PRIVFLAGS_FLAGS]) return -EINVAL; if (!ops->get_priv_flags || !ops->set_priv_flags || !ops->get_sset_count || !ops->get_strings) return -EOPNOTSUPP; return 1; } static int ethnl_set_privflags(struct ethnl_req_info *req_info, struct genl_info *info) { const char (*names)[ETH_GSTRING_LEN] = NULL; struct net_device *dev = req_info->dev; struct nlattr **tb = info->attrs; unsigned int nflags; bool mod = false; bool compact; u32 flags; int ret; ret = ethnl_bitset_is_compact(tb[ETHTOOL_A_PRIVFLAGS_FLAGS], &compact); if (ret < 0) return ret; ret = ethnl_get_priv_flags_info(dev, &nflags, compact ? NULL : &names); if (ret < 0) return ret; flags = dev->ethtool_ops->get_priv_flags(dev); ret = ethnl_update_bitset32(&flags, nflags, tb[ETHTOOL_A_PRIVFLAGS_FLAGS], names, info->extack, &mod); if (ret < 0 || !mod) goto out_free; ret = dev->ethtool_ops->set_priv_flags(dev, flags); if (ret < 0) goto out_free; ret = 1; out_free: kfree(names); return ret; } const struct ethnl_request_ops ethnl_privflags_request_ops = { .request_cmd = ETHTOOL_MSG_PRIVFLAGS_GET, .reply_cmd = ETHTOOL_MSG_PRIVFLAGS_GET_REPLY, .hdr_attr = ETHTOOL_A_PRIVFLAGS_HEADER, .req_info_size = sizeof(struct privflags_req_info), .reply_data_size = sizeof(struct privflags_reply_data), .prepare_data = privflags_prepare_data, .reply_size = privflags_reply_size, .fill_reply = privflags_fill_reply, .cleanup_data = privflags_cleanup_data, .set_validate = ethnl_set_privflags_validate, .set = ethnl_set_privflags, .set_ntf_cmd = ETHTOOL_MSG_PRIVFLAGS_NTF, }; |
| 1 3 4 4 2 1 2 1 55 33 1 11 10 12 5 1 1 1 1 1 3 2 4 2 12 3 1 2 6 1 1 74 3 2 56 13 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 | // SPDX-License-Identifier: GPL-2.0-only /* * Copyright (c) 2008-2009 Patrick McHardy <kaber@trash.net> * * Development of this code funded by Astaro AG (http://www.astaro.com/) */ #include <linux/kernel.h> #include <linux/init.h> #include <linux/module.h> #include <linux/netlink.h> #include <linux/netfilter.h> #include <linux/netfilter/nf_tables.h> #include <net/netfilter/nf_tables_core.h> #include <net/netfilter/nf_tables.h> #include <net/netfilter/nf_tables_offload.h> struct nft_bitwise { u8 sreg; u8 sreg2; u8 dreg; enum nft_bitwise_ops op:8; u8 len; struct nft_data mask; struct nft_data xor; struct nft_data data; }; static void nft_bitwise_eval_mask_xor(u32 *dst, const u32 *src, const struct nft_bitwise *priv) { unsigned int i; for (i = 0; i < DIV_ROUND_UP(priv->len, sizeof(u32)); i++) dst[i] = (src[i] & priv->mask.data[i]) ^ priv->xor.data[i]; } static void nft_bitwise_eval_lshift(u32 *dst, const u32 *src, const struct nft_bitwise *priv) { u32 shift = priv->data.data[0]; unsigned int i; u32 carry = 0; for (i = DIV_ROUND_UP(priv->len, sizeof(u32)); i > 0; i--) { dst[i - 1] = (src[i - 1] << shift) | carry; carry = src[i - 1] >> (BITS_PER_TYPE(u32) - shift); } } static void nft_bitwise_eval_rshift(u32 *dst, const u32 *src, const struct nft_bitwise *priv) { u32 shift = priv->data.data[0]; unsigned int i; u32 carry = 0; for (i = 0; i < DIV_ROUND_UP(priv->len, sizeof(u32)); i++) { dst[i] = carry | (src[i] >> shift); carry = src[i] << (BITS_PER_TYPE(u32) - shift); } } static void nft_bitwise_eval_and(u32 *dst, const u32 *src, const u32 *src2, const struct nft_bitwise *priv) { unsigned int i, n; for (i = 0, n = DIV_ROUND_UP(priv->len, sizeof(u32)); i < n; i++) dst[i] = src[i] & src2[i]; } static void nft_bitwise_eval_or(u32 *dst, const u32 *src, const u32 *src2, const struct nft_bitwise *priv) { unsigned int i, n; for (i = 0, n = DIV_ROUND_UP(priv->len, sizeof(u32)); i < n; i++) dst[i] = src[i] | src2[i]; } static void nft_bitwise_eval_xor(u32 *dst, const u32 *src, const u32 *src2, const struct nft_bitwise *priv) { unsigned int i, n; for (i = 0, n = DIV_ROUND_UP(priv->len, sizeof(u32)); i < n; i++) dst[i] = src[i] ^ src2[i]; } void nft_bitwise_eval(const struct nft_expr *expr, struct nft_regs *regs, const struct nft_pktinfo *pkt) { const struct nft_bitwise *priv = nft_expr_priv(expr); const u32 *src = ®s->data[priv->sreg], *src2; u32 *dst = ®s->data[priv->dreg]; if (priv->op == NFT_BITWISE_MASK_XOR) { nft_bitwise_eval_mask_xor(dst, src, priv); return; } if (priv->op == NFT_BITWISE_LSHIFT) { nft_bitwise_eval_lshift(dst, src, priv); return; } if (priv->op == NFT_BITWISE_RSHIFT) { nft_bitwise_eval_rshift(dst, src, priv); return; } src2 = priv->sreg2 ? ®s->data[priv->sreg2] : priv->data.data; if (priv->op == NFT_BITWISE_AND) { nft_bitwise_eval_and(dst, src, src2, priv); return; } if (priv->op == NFT_BITWISE_OR) { nft_bitwise_eval_or(dst, src, src2, priv); return; } if (priv->op == NFT_BITWISE_XOR) { nft_bitwise_eval_xor(dst, src, src2, priv); return; } } static const struct nla_policy nft_bitwise_policy[NFTA_BITWISE_MAX + 1] = { [NFTA_BITWISE_SREG] = { .type = NLA_U32 }, [NFTA_BITWISE_SREG2] = { .type = NLA_U32 }, [NFTA_BITWISE_DREG] = { .type = NLA_U32 }, [NFTA_BITWISE_LEN] = { .type = NLA_U32 }, [NFTA_BITWISE_MASK] = { .type = NLA_NESTED }, [NFTA_BITWISE_XOR] = { .type = NLA_NESTED }, [NFTA_BITWISE_OP] = NLA_POLICY_MAX(NLA_BE32, 255), [NFTA_BITWISE_DATA] = { .type = NLA_NESTED }, }; static int nft_bitwise_init_mask_xor(struct nft_bitwise *priv, const struct nlattr *const tb[]) { struct nft_data_desc mask = { .type = NFT_DATA_VALUE, .size = sizeof(priv->mask), .len = priv->len, }; struct nft_data_desc xor = { .type = NFT_DATA_VALUE, .size = sizeof(priv->xor), .len = priv->len, }; int err; if (tb[NFTA_BITWISE_DATA] || tb[NFTA_BITWISE_SREG2]) return -EINVAL; if (!tb[NFTA_BITWISE_MASK] || !tb[NFTA_BITWISE_XOR]) return -EINVAL; err = nft_data_init(NULL, &priv->mask, &mask, tb[NFTA_BITWISE_MASK]); if (err < 0) return err; err = nft_data_init(NULL, &priv->xor, &xor, tb[NFTA_BITWISE_XOR]); if (err < 0) goto err_xor_err; return 0; err_xor_err: nft_data_release(&priv->mask, mask.type); return err; } static int nft_bitwise_init_shift(struct nft_bitwise *priv, const struct nlattr *const tb[]) { struct nft_data_desc desc = { .type = NFT_DATA_VALUE, .size = sizeof(priv->data), .len = sizeof(u32), }; int err; if (tb[NFTA_BITWISE_MASK] || tb[NFTA_BITWISE_XOR] || tb[NFTA_BITWISE_SREG2]) return -EINVAL; if (!tb[NFTA_BITWISE_DATA]) return -EINVAL; err = nft_data_init(NULL, &priv->data, &desc, tb[NFTA_BITWISE_DATA]); if (err < 0) return err; if (priv->data.data[0] >= BITS_PER_TYPE(u32)) { nft_data_release(&priv->data, desc.type); return -EINVAL; } return 0; } static int nft_bitwise_init_bool(const struct nft_ctx *ctx, struct nft_bitwise *priv, const struct nlattr *const tb[]) { int err; if (tb[NFTA_BITWISE_MASK] || tb[NFTA_BITWISE_XOR]) return -EINVAL; if ((!tb[NFTA_BITWISE_DATA] && !tb[NFTA_BITWISE_SREG2]) || (tb[NFTA_BITWISE_DATA] && tb[NFTA_BITWISE_SREG2])) return -EINVAL; if (tb[NFTA_BITWISE_DATA]) { struct nft_data_desc desc = { .type = NFT_DATA_VALUE, .size = sizeof(priv->data), .len = priv->len, }; err = nft_data_init(NULL, &priv->data, &desc, tb[NFTA_BITWISE_DATA]); if (err < 0) return err; } else { err = nft_parse_register_load(ctx, tb[NFTA_BITWISE_SREG2], &priv->sreg2, priv->len); if (err < 0) return err; } return 0; } static int nft_bitwise_init(const struct nft_ctx *ctx, const struct nft_expr *expr, const struct nlattr * const tb[]) { struct nft_bitwise *priv = nft_expr_priv(expr); u32 len; int err; err = nft_parse_u32_check(tb[NFTA_BITWISE_LEN], U8_MAX, &len); if (err < 0) return err; priv->len = len; err = nft_parse_register_load(ctx, tb[NFTA_BITWISE_SREG], &priv->sreg, priv->len); if (err < 0) return err; err = nft_parse_register_store(ctx, tb[NFTA_BITWISE_DREG], &priv->dreg, NULL, NFT_DATA_VALUE, priv->len); if (err < 0) return err; if (tb[NFTA_BITWISE_OP]) { priv->op = ntohl(nla_get_be32(tb[NFTA_BITWISE_OP])); switch (priv->op) { case NFT_BITWISE_MASK_XOR: case NFT_BITWISE_LSHIFT: case NFT_BITWISE_RSHIFT: case NFT_BITWISE_AND: case NFT_BITWISE_OR: case NFT_BITWISE_XOR: break; default: return -EOPNOTSUPP; } } else { priv->op = NFT_BITWISE_MASK_XOR; } switch(priv->op) { case NFT_BITWISE_MASK_XOR: err = nft_bitwise_init_mask_xor(priv, tb); break; case NFT_BITWISE_LSHIFT: case NFT_BITWISE_RSHIFT: err = nft_bitwise_init_shift(priv, tb); break; case NFT_BITWISE_AND: case NFT_BITWISE_OR: case NFT_BITWISE_XOR: err = nft_bitwise_init_bool(ctx, priv, tb); break; } return err; } static int nft_bitwise_dump_mask_xor(struct sk_buff *skb, const struct nft_bitwise *priv) { if (nft_data_dump(skb, NFTA_BITWISE_MASK, &priv->mask, NFT_DATA_VALUE, priv->len) < 0) return -1; if (nft_data_dump(skb, NFTA_BITWISE_XOR, &priv->xor, NFT_DATA_VALUE, priv->len) < 0) return -1; return 0; } static int nft_bitwise_dump_shift(struct sk_buff *skb, const struct nft_bitwise *priv) { if (nft_data_dump(skb, NFTA_BITWISE_DATA, &priv->data, NFT_DATA_VALUE, sizeof(u32)) < 0) return -1; return 0; } static int nft_bitwise_dump_bool(struct sk_buff *skb, const struct nft_bitwise *priv) { if (priv->sreg2) { if (nft_dump_register(skb, NFTA_BITWISE_SREG2, priv->sreg2)) return -1; } else { if (nft_data_dump(skb, NFTA_BITWISE_DATA, &priv->data, NFT_DATA_VALUE, sizeof(u32)) < 0) return -1; } return 0; } static int nft_bitwise_dump(struct sk_buff *skb, const struct nft_expr *expr, bool reset) { const struct nft_bitwise *priv = nft_expr_priv(expr); int err = 0; if (nft_dump_register(skb, NFTA_BITWISE_SREG, priv->sreg)) return -1; if (nft_dump_register(skb, NFTA_BITWISE_DREG, priv->dreg)) return -1; if (nla_put_be32(skb, NFTA_BITWISE_LEN, htonl(priv->len))) return -1; if (nla_put_be32(skb, NFTA_BITWISE_OP, htonl(priv->op))) return -1; switch (priv->op) { case NFT_BITWISE_MASK_XOR: err = nft_bitwise_dump_mask_xor(skb, priv); break; case NFT_BITWISE_LSHIFT: case NFT_BITWISE_RSHIFT: err = nft_bitwise_dump_shift(skb, priv); break; case NFT_BITWISE_AND: case NFT_BITWISE_OR: case NFT_BITWISE_XOR: err = nft_bitwise_dump_bool(skb, priv); break; } return err; } static struct nft_data zero; static int nft_bitwise_offload(struct nft_offload_ctx *ctx, struct nft_flow_rule *flow, const struct nft_expr *expr) { const struct nft_bitwise *priv = nft_expr_priv(expr); struct nft_offload_reg *reg = &ctx->regs[priv->dreg]; if (priv->op != NFT_BITWISE_MASK_XOR) return -EOPNOTSUPP; if (memcmp(&priv->xor, &zero, sizeof(priv->xor)) || priv->sreg != priv->dreg || priv->len != reg->len) return -EOPNOTSUPP; memcpy(®->mask, &priv->mask, sizeof(priv->mask)); return 0; } static bool nft_bitwise_reduce(struct nft_regs_track *track, const struct nft_expr *expr) { const struct nft_bitwise *priv = nft_expr_priv(expr); const struct nft_bitwise *bitwise; unsigned int regcount; u8 dreg; int i; if (!track->regs[priv->sreg].selector) return false; bitwise = nft_expr_priv(track->regs[priv->dreg].selector); if (track->regs[priv->sreg].selector == track->regs[priv->dreg].selector && track->regs[priv->sreg].num_reg == 0 && track->regs[priv->dreg].bitwise && track->regs[priv->dreg].bitwise->ops == expr->ops && priv->sreg == bitwise->sreg && priv->sreg2 == bitwise->sreg2 && priv->dreg == bitwise->dreg && priv->op == bitwise->op && priv->len == bitwise->len && !memcmp(&priv->mask, &bitwise->mask, sizeof(priv->mask)) && !memcmp(&priv->xor, &bitwise->xor, sizeof(priv->xor)) && !memcmp(&priv->data, &bitwise->data, sizeof(priv->data))) { track->cur = expr; return true; } if (track->regs[priv->sreg].bitwise || track->regs[priv->sreg].num_reg != 0) { nft_reg_track_cancel(track, priv->dreg, priv->len); return false; } if (priv->sreg != priv->dreg) { nft_reg_track_update(track, track->regs[priv->sreg].selector, priv->dreg, priv->len); } dreg = priv->dreg; regcount = DIV_ROUND_UP(priv->len, NFT_REG32_SIZE); for (i = 0; i < regcount; i++, dreg++) track->regs[dreg].bitwise = expr; return false; } static const struct nft_expr_ops nft_bitwise_ops = { .type = &nft_bitwise_type, .size = NFT_EXPR_SIZE(sizeof(struct nft_bitwise)), .eval = nft_bitwise_eval, .init = nft_bitwise_init, .dump = nft_bitwise_dump, .reduce = nft_bitwise_reduce, .offload = nft_bitwise_offload, }; static int nft_bitwise_extract_u32_data(const struct nlattr * const tb, u32 *out) { struct nft_data data; struct nft_data_desc desc = { .type = NFT_DATA_VALUE, .size = sizeof(data), .len = sizeof(u32), }; int err; err = nft_data_init(NULL, &data, &desc, tb); if (err < 0) return err; *out = data.data[0]; return 0; } static int nft_bitwise_fast_init(const struct nft_ctx *ctx, const struct nft_expr *expr, const struct nlattr * const tb[]) { struct nft_bitwise_fast_expr *priv = nft_expr_priv(expr); int err; err = nft_parse_register_load(ctx, tb[NFTA_BITWISE_SREG], &priv->sreg, sizeof(u32)); if (err < 0) return err; err = nft_parse_register_store(ctx, tb[NFTA_BITWISE_DREG], &priv->dreg, NULL, NFT_DATA_VALUE, sizeof(u32)); if (err < 0) return err; if (tb[NFTA_BITWISE_DATA] || tb[NFTA_BITWISE_SREG2]) return -EINVAL; if (!tb[NFTA_BITWISE_MASK] || !tb[NFTA_BITWISE_XOR]) return -EINVAL; err = nft_bitwise_extract_u32_data(tb[NFTA_BITWISE_MASK], &priv->mask); if (err < 0) return err; err = nft_bitwise_extract_u32_data(tb[NFTA_BITWISE_XOR], &priv->xor); if (err < 0) return err; return 0; } static int nft_bitwise_fast_dump(struct sk_buff *skb, const struct nft_expr *expr, bool reset) { const struct nft_bitwise_fast_expr *priv = nft_expr_priv(expr); struct nft_data data; if (nft_dump_register(skb, NFTA_BITWISE_SREG, priv->sreg)) return -1; if (nft_dump_register(skb, NFTA_BITWISE_DREG, priv->dreg)) return -1; if (nla_put_be32(skb, NFTA_BITWISE_LEN, htonl(sizeof(u32)))) return -1; if (nla_put_be32(skb, NFTA_BITWISE_OP, htonl(NFT_BITWISE_MASK_XOR))) return -1; data.data[0] = priv->mask; if (nft_data_dump(skb, NFTA_BITWISE_MASK, &data, NFT_DATA_VALUE, sizeof(u32)) < 0) return -1; data.data[0] = priv->xor; if (nft_data_dump(skb, NFTA_BITWISE_XOR, &data, NFT_DATA_VALUE, sizeof(u32)) < 0) return -1; return 0; } static int nft_bitwise_fast_offload(struct nft_offload_ctx *ctx, struct nft_flow_rule *flow, const struct nft_expr *expr) { const struct nft_bitwise_fast_expr *priv = nft_expr_priv(expr); struct nft_offload_reg *reg = &ctx->regs[priv->dreg]; if (priv->xor || priv->sreg != priv->dreg || reg->len != sizeof(u32)) return -EOPNOTSUPP; reg->mask.data[0] = priv->mask; return 0; } static bool nft_bitwise_fast_reduce(struct nft_regs_track *track, const struct nft_expr *expr) { const struct nft_bitwise_fast_expr *priv = nft_expr_priv(expr); const struct nft_bitwise_fast_expr *bitwise; if (!track->regs[priv->sreg].selector) return false; bitwise = nft_expr_priv(track->regs[priv->dreg].selector); if (track->regs[priv->sreg].selector == track->regs[priv->dreg].selector && track->regs[priv->dreg].bitwise && track->regs[priv->dreg].bitwise->ops == expr->ops && priv->sreg == bitwise->sreg && priv->dreg == bitwise->dreg && priv->mask == bitwise->mask && priv->xor == bitwise->xor) { track->cur = expr; return true; } if (track->regs[priv->sreg].bitwise) { nft_reg_track_cancel(track, priv->dreg, NFT_REG32_SIZE); return false; } if (priv->sreg != priv->dreg) { track->regs[priv->dreg].selector = track->regs[priv->sreg].selector; } track->regs[priv->dreg].bitwise = expr; return false; } const struct nft_expr_ops nft_bitwise_fast_ops = { .type = &nft_bitwise_type, .size = NFT_EXPR_SIZE(sizeof(struct nft_bitwise_fast_expr)), .eval = NULL, /* inlined */ .init = nft_bitwise_fast_init, .dump = nft_bitwise_fast_dump, .reduce = nft_bitwise_fast_reduce, .offload = nft_bitwise_fast_offload, }; static const struct nft_expr_ops * nft_bitwise_select_ops(const struct nft_ctx *ctx, const struct nlattr * const tb[]) { int err; u32 len; if (!tb[NFTA_BITWISE_LEN] || !tb[NFTA_BITWISE_SREG] || !tb[NFTA_BITWISE_DREG]) return ERR_PTR(-EINVAL); err = nft_parse_u32_check(tb[NFTA_BITWISE_LEN], U8_MAX, &len); if (err < 0) return ERR_PTR(err); if (len != sizeof(u32)) return &nft_bitwise_ops; if (tb[NFTA_BITWISE_OP] && ntohl(nla_get_be32(tb[NFTA_BITWISE_OP])) != NFT_BITWISE_MASK_XOR) return &nft_bitwise_ops; return &nft_bitwise_fast_ops; } struct nft_expr_type nft_bitwise_type __read_mostly = { .name = "bitwise", .select_ops = nft_bitwise_select_ops, .policy = nft_bitwise_policy, .maxattr = NFTA_BITWISE_MAX, .owner = THIS_MODULE, }; bool nft_expr_reduce_bitwise(struct nft_regs_track *track, const struct nft_expr *expr) { const struct nft_expr *last = track->last; const struct nft_expr *next; if (expr == last) return false; next = nft_expr_next(expr); if (next->ops == &nft_bitwise_ops) return nft_bitwise_reduce(track, next); else if (next->ops == &nft_bitwise_fast_ops) return nft_bitwise_fast_reduce(track, next); return false; } EXPORT_SYMBOL_GPL(nft_expr_reduce_bitwise); |
| 523 524 30 266 237 266 216 79 40 79 79 79 79 79 79 79 29 18 11 540 23 619 12 25 81 128 39 46 47 39 39 42 42 18 1 2 16 6 8 3 11 2 20 18 36 35 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 | // SPDX-License-Identifier: GPL-2.0 /* * High-level sync()-related operations */ #include <linux/blkdev.h> #include <linux/kernel.h> #include <linux/file.h> #include <linux/fs.h> #include <linux/slab.h> #include <linux/export.h> #include <linux/namei.h> #include <linux/sched.h> #include <linux/writeback.h> #include <linux/syscalls.h> #include <linux/linkage.h> #include <linux/pagemap.h> #include <linux/quotaops.h> #include <linux/backing-dev.h> #include "internal.h" #define VALID_FLAGS (SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE| \ SYNC_FILE_RANGE_WAIT_AFTER) /* * Write out and wait upon all dirty data associated with this * superblock. Filesystem data as well as the underlying block * device. Takes the superblock lock. */ int sync_filesystem(struct super_block *sb) { int ret = 0; /* * We need to be protected against the filesystem going from * r/o to r/w or vice versa. */ WARN_ON(!rwsem_is_locked(&sb->s_umount)); /* * No point in syncing out anything if the filesystem is read-only. */ if (sb_rdonly(sb)) return 0; /* * Do the filesystem syncing work. For simple filesystems * writeback_inodes_sb(sb) just dirties buffers with inodes so we have * to submit I/O for these buffers via sync_blockdev(). This also * speeds up the wait == 1 case since in that case write_inode() * methods call sync_dirty_buffer() and thus effectively write one block * at a time. */ writeback_inodes_sb(sb, WB_REASON_SYNC); if (sb->s_op->sync_fs) { ret = sb->s_op->sync_fs(sb, 0); if (ret) return ret; } ret = sync_blockdev_nowait(sb->s_bdev); if (ret) return ret; sync_inodes_sb(sb); if (sb->s_op->sync_fs) { ret = sb->s_op->sync_fs(sb, 1); if (ret) return ret; } return sync_blockdev(sb->s_bdev); } EXPORT_SYMBOL(sync_filesystem); static void sync_inodes_one_sb(struct super_block *sb, void *arg) { if (!sb_rdonly(sb)) sync_inodes_sb(sb); } static void sync_fs_one_sb(struct super_block *sb, void *arg) { if (!sb_rdonly(sb) && !(sb->s_iflags & SB_I_SKIP_SYNC) && sb->s_op->sync_fs) sb->s_op->sync_fs(sb, *(int *)arg); } /* * Sync everything. We start by waking flusher threads so that most of * writeback runs on all devices in parallel. Then we sync all inodes reliably * which effectively also waits for all flusher threads to finish doing * writeback. At this point all data is on disk so metadata should be stable * and we tell filesystems to sync their metadata via ->sync_fs() calls. * Finally, we writeout all block devices because some filesystems (e.g. ext2) * just write metadata (such as inodes or bitmaps) to block device page cache * and do not sync it on their own in ->sync_fs(). */ void ksys_sync(void) { int nowait = 0, wait = 1; wakeup_flusher_threads(WB_REASON_SYNC); iterate_supers(sync_inodes_one_sb, NULL); iterate_supers(sync_fs_one_sb, &nowait); iterate_supers(sync_fs_one_sb, &wait); sync_bdevs(false); sync_bdevs(true); if (unlikely(laptop_mode)) laptop_sync_completion(); } SYSCALL_DEFINE0(sync) { ksys_sync(); return 0; } static void do_sync_work(struct work_struct *work) { int nowait = 0; /* * Sync twice to reduce the possibility we skipped some inodes / pages * because they were temporarily locked */ iterate_supers(sync_inodes_one_sb, &nowait); iterate_supers(sync_fs_one_sb, &nowait); sync_bdevs(false); iterate_supers(sync_inodes_one_sb, &nowait); iterate_supers(sync_fs_one_sb, &nowait); sync_bdevs(false); printk("Emergency Sync complete\n"); kfree(work); } void emergency_sync(void) { struct work_struct *work; work = kmalloc(sizeof(*work), GFP_ATOMIC); if (work) { INIT_WORK(work, do_sync_work); schedule_work(work); } } /* * sync a single super */ SYSCALL_DEFINE1(syncfs, int, fd) { CLASS(fd, f)(fd); struct super_block *sb; int ret, ret2; if (fd_empty(f)) return -EBADF; sb = fd_file(f)->f_path.dentry->d_sb; down_read(&sb->s_umount); ret = sync_filesystem(sb); up_read(&sb->s_umount); ret2 = errseq_check_and_advance(&sb->s_wb_err, &fd_file(f)->f_sb_err); return ret ? ret : ret2; } /** * vfs_fsync_range - helper to sync a range of data & metadata to disk * @file: file to sync * @start: offset in bytes of the beginning of data range to sync * @end: offset in bytes of the end of data range (inclusive) * @datasync: perform only datasync * * Write back data in range @start..@end and metadata for @file to disk. If * @datasync is set only metadata needed to access modified file data is * written. */ int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync) { struct inode *inode = file->f_mapping->host; if (!file->f_op->fsync) return -EINVAL; if (!datasync && (inode->i_state & I_DIRTY_TIME)) mark_inode_dirty_sync(inode); return file->f_op->fsync(file, start, end, datasync); } EXPORT_SYMBOL(vfs_fsync_range); /** * vfs_fsync - perform a fsync or fdatasync on a file * @file: file to sync * @datasync: only perform a fdatasync operation * * Write back data and metadata for @file to disk. If @datasync is * set only metadata needed to access modified file data is written. */ int vfs_fsync(struct file *file, int datasync) { return vfs_fsync_range(file, 0, LLONG_MAX, datasync); } EXPORT_SYMBOL(vfs_fsync); static int do_fsync(unsigned int fd, int datasync) { CLASS(fd, f)(fd); if (fd_empty(f)) return -EBADF; return vfs_fsync(fd_file(f), datasync); } SYSCALL_DEFINE1(fsync, unsigned int, fd) { return do_fsync(fd, 0); } SYSCALL_DEFINE1(fdatasync, unsigned int, fd) { return do_fsync(fd, 1); } int sync_file_range(struct file *file, loff_t offset, loff_t nbytes, unsigned int flags) { int ret; struct address_space *mapping; loff_t endbyte; /* inclusive */ umode_t i_mode; ret = -EINVAL; if (flags & ~VALID_FLAGS) goto out; endbyte = offset + nbytes; if ((s64)offset < 0) goto out; if ((s64)endbyte < 0) goto out; if (endbyte < offset) goto out; if (sizeof(pgoff_t) == 4) { if (offset >= (0x100000000ULL << PAGE_SHIFT)) { /* * The range starts outside a 32 bit machine's * pagecache addressing capabilities. Let it "succeed" */ ret = 0; goto out; } if (endbyte >= (0x100000000ULL << PAGE_SHIFT)) { /* * Out to EOF */ nbytes = 0; } } if (nbytes == 0) endbyte = LLONG_MAX; else endbyte--; /* inclusive */ i_mode = file_inode(file)->i_mode; ret = -ESPIPE; if (!S_ISREG(i_mode) && !S_ISBLK(i_mode) && !S_ISDIR(i_mode) && !S_ISLNK(i_mode)) goto out; mapping = file->f_mapping; ret = 0; if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) { ret = file_fdatawait_range(file, offset, endbyte); if (ret < 0) goto out; } if (flags & SYNC_FILE_RANGE_WRITE) { int sync_mode = WB_SYNC_NONE; if ((flags & SYNC_FILE_RANGE_WRITE_AND_WAIT) == SYNC_FILE_RANGE_WRITE_AND_WAIT) sync_mode = WB_SYNC_ALL; ret = __filemap_fdatawrite_range(mapping, offset, endbyte, sync_mode); if (ret < 0) goto out; } if (flags & SYNC_FILE_RANGE_WAIT_AFTER) ret = file_fdatawait_range(file, offset, endbyte); out: return ret; } /* * ksys_sync_file_range() permits finely controlled syncing over a segment of * a file in the range offset .. (offset+nbytes-1) inclusive. If nbytes is * zero then ksys_sync_file_range() will operate from offset out to EOF. * * The flag bits are: * * SYNC_FILE_RANGE_WAIT_BEFORE: wait upon writeout of all pages in the range * before performing the write. * * SYNC_FILE_RANGE_WRITE: initiate writeout of all those dirty pages in the * range which are not presently under writeback. Note that this may block for * significant periods due to exhaustion of disk request structures. * * SYNC_FILE_RANGE_WAIT_AFTER: wait upon writeout of all pages in the range * after performing the write. * * Useful combinations of the flag bits are: * * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE: ensures that all pages * in the range which were dirty on entry to ksys_sync_file_range() are placed * under writeout. This is a start-write-for-data-integrity operation. * * SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range which * are not presently under writeout. This is an asynchronous flush-to-disk * operation. Not suitable for data integrity operations. * * SYNC_FILE_RANGE_WAIT_BEFORE (or SYNC_FILE_RANGE_WAIT_AFTER): wait for * completion of writeout of all pages in the range. This will be used after an * earlier SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE operation to wait * for that operation to complete and to return the result. * * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER * (a.k.a. SYNC_FILE_RANGE_WRITE_AND_WAIT): * a traditional sync() operation. This is a write-for-data-integrity operation * which will ensure that all pages in the range which were dirty on entry to * ksys_sync_file_range() are written to disk. It should be noted that disk * caches are not flushed by this call, so there are no guarantees here that the * data will be available on disk after a crash. * * * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any * I/O errors or ENOSPC conditions and will return those to the caller, after * clearing the EIO and ENOSPC flags in the address_space. * * It should be noted that none of these operations write out the file's * metadata. So unless the application is strictly performing overwrites of * already-instantiated disk blocks, there are no guarantees here that the data * will be available after a crash. */ int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes, unsigned int flags) { CLASS(fd, f)(fd); if (fd_empty(f)) return -EBADF; return sync_file_range(fd_file(f), offset, nbytes, flags); } SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes, unsigned int, flags) { return ksys_sync_file_range(fd, offset, nbytes, flags); } #if defined(CONFIG_COMPAT) && defined(__ARCH_WANT_COMPAT_SYNC_FILE_RANGE) COMPAT_SYSCALL_DEFINE6(sync_file_range, int, fd, compat_arg_u64_dual(offset), compat_arg_u64_dual(nbytes), unsigned int, flags) { return ksys_sync_file_range(fd, compat_arg_u64_glue(offset), compat_arg_u64_glue(nbytes), flags); } #endif /* It would be nice if people remember that not all the world's an i386 when they introduce new system calls */ SYSCALL_DEFINE4(sync_file_range2, int, fd, unsigned int, flags, loff_t, offset, loff_t, nbytes) { return ksys_sync_file_range(fd, offset, nbytes, flags); } |
| 430 1 430 430 1 430 1 1 3537 3539 428 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 2 2 2 2 1 1 1 2 2 1 2 2 2 2 2 2 1 1 4717 4722 2682 2682 1647 261 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 | // SPDX-License-Identifier: GPL-2.0-only #include <linux/netdevice.h> #include <linux/notifier.h> #include <linux/rtnetlink.h> #include <net/busy_poll.h> #include <net/net_namespace.h> #include <net/netdev_queues.h> #include <net/netdev_rx_queue.h> #include <net/sock.h> #include <net/xdp.h> #include <net/xdp_sock.h> #include <net/page_pool/memory_provider.h> #include "dev.h" #include "devmem.h" #include "netdev-genl-gen.h" struct netdev_nl_dump_ctx { unsigned long ifindex; unsigned int rxq_idx; unsigned int txq_idx; unsigned int napi_id; }; static struct netdev_nl_dump_ctx *netdev_dump_ctx(struct netlink_callback *cb) { NL_ASSERT_CTX_FITS(struct netdev_nl_dump_ctx); return (struct netdev_nl_dump_ctx *)cb->ctx; } static int netdev_nl_dev_fill(struct net_device *netdev, struct sk_buff *rsp, const struct genl_info *info) { u64 xsk_features = 0; u64 xdp_rx_meta = 0; void *hdr; netdev_assert_locked(netdev); /* note: rtnl_lock may not be held! */ hdr = genlmsg_iput(rsp, info); if (!hdr) return -EMSGSIZE; #define XDP_METADATA_KFUNC(_, flag, __, xmo) \ if (netdev->xdp_metadata_ops && netdev->xdp_metadata_ops->xmo) \ xdp_rx_meta |= flag; XDP_METADATA_KFUNC_xxx #undef XDP_METADATA_KFUNC if (netdev->xsk_tx_metadata_ops) { if (netdev->xsk_tx_metadata_ops->tmo_fill_timestamp) xsk_features |= NETDEV_XSK_FLAGS_TX_TIMESTAMP; if (netdev->xsk_tx_metadata_ops->tmo_request_checksum) xsk_features |= NETDEV_XSK_FLAGS_TX_CHECKSUM; if (netdev->xsk_tx_metadata_ops->tmo_request_launch_time) xsk_features |= NETDEV_XSK_FLAGS_TX_LAUNCH_TIME_FIFO; } if (nla_put_u32(rsp, NETDEV_A_DEV_IFINDEX, netdev->ifindex) || nla_put_u64_64bit(rsp, NETDEV_A_DEV_XDP_FEATURES, netdev->xdp_features, NETDEV_A_DEV_PAD) || nla_put_u64_64bit(rsp, NETDEV_A_DEV_XDP_RX_METADATA_FEATURES, xdp_rx_meta, NETDEV_A_DEV_PAD) || nla_put_u64_64bit(rsp, NETDEV_A_DEV_XSK_FEATURES, xsk_features, NETDEV_A_DEV_PAD)) goto err_cancel_msg; if (netdev->xdp_features & NETDEV_XDP_ACT_XSK_ZEROCOPY) { if (nla_put_u32(rsp, NETDEV_A_DEV_XDP_ZC_MAX_SEGS, netdev->xdp_zc_max_segs)) goto err_cancel_msg; } genlmsg_end(rsp, hdr); return 0; err_cancel_msg: genlmsg_cancel(rsp, hdr); return -EMSGSIZE; } static void netdev_genl_dev_notify(struct net_device *netdev, int cmd) { struct genl_info info; struct sk_buff *ntf; if (!genl_has_listeners(&netdev_nl_family, dev_net(netdev), NETDEV_NLGRP_MGMT)) return; genl_info_init_ntf(&info, &netdev_nl_family, cmd); ntf = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!ntf) return; if (netdev_nl_dev_fill(netdev, ntf, &info)) { nlmsg_free(ntf); return; } genlmsg_multicast_netns(&netdev_nl_family, dev_net(netdev), ntf, 0, NETDEV_NLGRP_MGMT, GFP_KERNEL); } int netdev_nl_dev_get_doit(struct sk_buff *skb, struct genl_info *info) { struct net_device *netdev; struct sk_buff *rsp; u32 ifindex; int err; if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_DEV_IFINDEX)) return -EINVAL; ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]); rsp = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!rsp) return -ENOMEM; netdev = netdev_get_by_index_lock(genl_info_net(info), ifindex); if (!netdev) { err = -ENODEV; goto err_free_msg; } err = netdev_nl_dev_fill(netdev, rsp, info); netdev_unlock(netdev); if (err) goto err_free_msg; return genlmsg_reply(rsp, info); err_free_msg: nlmsg_free(rsp); return err; } int netdev_nl_dev_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb) { struct netdev_nl_dump_ctx *ctx = netdev_dump_ctx(cb); struct net *net = sock_net(skb->sk); int err; for_each_netdev_lock_scoped(net, netdev, ctx->ifindex) { err = netdev_nl_dev_fill(netdev, skb, genl_info_dump(cb)); if (err < 0) return err; } return 0; } static int netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi, const struct genl_info *info) { unsigned long irq_suspend_timeout; unsigned long gro_flush_timeout; u32 napi_defer_hard_irqs; void *hdr; pid_t pid; if (!napi->dev->up) return 0; hdr = genlmsg_iput(rsp, info); if (!hdr) return -EMSGSIZE; if (nla_put_u32(rsp, NETDEV_A_NAPI_ID, napi->napi_id)) goto nla_put_failure; if (nla_put_u32(rsp, NETDEV_A_NAPI_IFINDEX, napi->dev->ifindex)) goto nla_put_failure; if (napi->irq >= 0 && nla_put_u32(rsp, NETDEV_A_NAPI_IRQ, napi->irq)) goto nla_put_failure; if (nla_put_uint(rsp, NETDEV_A_NAPI_THREADED, napi_get_threaded(napi))) goto nla_put_failure; if (napi->thread) { pid = task_pid_nr(napi->thread); if (nla_put_u32(rsp, NETDEV_A_NAPI_PID, pid)) goto nla_put_failure; } napi_defer_hard_irqs = napi_get_defer_hard_irqs(napi); if (nla_put_s32(rsp, NETDEV_A_NAPI_DEFER_HARD_IRQS, napi_defer_hard_irqs)) goto nla_put_failure; irq_suspend_timeout = napi_get_irq_suspend_timeout(napi); if (nla_put_uint(rsp, NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT, irq_suspend_timeout)) goto nla_put_failure; gro_flush_timeout = napi_get_gro_flush_timeout(napi); if (nla_put_uint(rsp, NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT, gro_flush_timeout)) goto nla_put_failure; genlmsg_end(rsp, hdr); return 0; nla_put_failure: genlmsg_cancel(rsp, hdr); return -EMSGSIZE; } int netdev_nl_napi_get_doit(struct sk_buff *skb, struct genl_info *info) { struct napi_struct *napi; struct sk_buff *rsp; u32 napi_id; int err; if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_NAPI_ID)) return -EINVAL; napi_id = nla_get_u32(info->attrs[NETDEV_A_NAPI_ID]); rsp = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!rsp) return -ENOMEM; napi = netdev_napi_by_id_lock(genl_info_net(info), napi_id); if (napi) { err = netdev_nl_napi_fill_one(rsp, napi, info); netdev_unlock(napi->dev); } else { NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_NAPI_ID]); err = -ENOENT; } if (err) { goto err_free_msg; } else if (!rsp->len) { err = -ENOENT; goto err_free_msg; } return genlmsg_reply(rsp, info); err_free_msg: nlmsg_free(rsp); return err; } static int netdev_nl_napi_dump_one(struct net_device *netdev, struct sk_buff *rsp, const struct genl_info *info, struct netdev_nl_dump_ctx *ctx) { struct napi_struct *napi; unsigned int prev_id; int err = 0; if (!netdev->up) return err; prev_id = UINT_MAX; list_for_each_entry(napi, &netdev->napi_list, dev_list) { if (!napi_id_valid(napi->napi_id)) continue; /* Dump continuation below depends on the list being sorted */ WARN_ON_ONCE(napi->napi_id >= prev_id); prev_id = napi->napi_id; if (ctx->napi_id && napi->napi_id >= ctx->napi_id) continue; err = netdev_nl_napi_fill_one(rsp, napi, info); if (err) return err; ctx->napi_id = napi->napi_id; } return err; } int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb) { struct netdev_nl_dump_ctx *ctx = netdev_dump_ctx(cb); const struct genl_info *info = genl_info_dump(cb); struct net *net = sock_net(skb->sk); struct net_device *netdev; u32 ifindex = 0; int err = 0; if (info->attrs[NETDEV_A_NAPI_IFINDEX]) ifindex = nla_get_u32(info->attrs[NETDEV_A_NAPI_IFINDEX]); if (ifindex) { netdev = netdev_get_by_index_lock(net, ifindex); if (netdev) { err = netdev_nl_napi_dump_one(netdev, skb, info, ctx); netdev_unlock(netdev); } else { err = -ENODEV; } } else { for_each_netdev_lock_scoped(net, netdev, ctx->ifindex) { err = netdev_nl_napi_dump_one(netdev, skb, info, ctx); if (err < 0) break; ctx->napi_id = 0; } } return err; } static int netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info) { u64 irq_suspend_timeout = 0; u64 gro_flush_timeout = 0; u8 threaded = 0; u32 defer = 0; if (info->attrs[NETDEV_A_NAPI_THREADED]) { int ret; threaded = nla_get_uint(info->attrs[NETDEV_A_NAPI_THREADED]); ret = napi_set_threaded(napi, threaded); if (ret) return ret; } if (info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]) { defer = nla_get_u32(info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]); napi_set_defer_hard_irqs(napi, defer); } if (info->attrs[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT]) { irq_suspend_timeout = nla_get_uint(info->attrs[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT]); napi_set_irq_suspend_timeout(napi, irq_suspend_timeout); } if (info->attrs[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT]) { gro_flush_timeout = nla_get_uint(info->attrs[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT]); napi_set_gro_flush_timeout(napi, gro_flush_timeout); } return 0; } int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info) { struct napi_struct *napi; unsigned int napi_id; int err; if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_NAPI_ID)) return -EINVAL; napi_id = nla_get_u32(info->attrs[NETDEV_A_NAPI_ID]); napi = netdev_napi_by_id_lock(genl_info_net(info), napi_id); if (napi) { err = netdev_nl_napi_set_config(napi, info); netdev_unlock(napi->dev); } else { NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_NAPI_ID]); err = -ENOENT; } return err; } static int nla_put_napi_id(struct sk_buff *skb, const struct napi_struct *napi) { if (napi && napi_id_valid(napi->napi_id)) return nla_put_u32(skb, NETDEV_A_QUEUE_NAPI_ID, napi->napi_id); return 0; } static int netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev, u32 q_idx, u32 q_type, const struct genl_info *info) { struct pp_memory_provider_params *params; struct netdev_rx_queue *rxq; struct netdev_queue *txq; void *hdr; hdr = genlmsg_iput(rsp, info); if (!hdr) return -EMSGSIZE; if (nla_put_u32(rsp, NETDEV_A_QUEUE_ID, q_idx) || nla_put_u32(rsp, NETDEV_A_QUEUE_TYPE, q_type) || nla_put_u32(rsp, NETDEV_A_QUEUE_IFINDEX, netdev->ifindex)) goto nla_put_failure; switch (q_type) { case NETDEV_QUEUE_TYPE_RX: rxq = __netif_get_rx_queue(netdev, q_idx); if (nla_put_napi_id(rsp, rxq->napi)) goto nla_put_failure; params = &rxq->mp_params; if (params->mp_ops && params->mp_ops->nl_fill(params->mp_priv, rsp, rxq)) goto nla_put_failure; #ifdef CONFIG_XDP_SOCKETS if (rxq->pool) if (nla_put_empty_nest(rsp, NETDEV_A_QUEUE_XSK)) goto nla_put_failure; #endif break; case NETDEV_QUEUE_TYPE_TX: txq = netdev_get_tx_queue(netdev, q_idx); if (nla_put_napi_id(rsp, txq->napi)) goto nla_put_failure; #ifdef CONFIG_XDP_SOCKETS if (txq->pool) if (nla_put_empty_nest(rsp, NETDEV_A_QUEUE_XSK)) goto nla_put_failure; #endif break; } genlmsg_end(rsp, hdr); return 0; nla_put_failure: genlmsg_cancel(rsp, hdr); return -EMSGSIZE; } static int netdev_nl_queue_validate(struct net_device *netdev, u32 q_id, u32 q_type) { switch (q_type) { case NETDEV_QUEUE_TYPE_RX: if (q_id >= netdev->real_num_rx_queues) return -EINVAL; return 0; case NETDEV_QUEUE_TYPE_TX: if (q_id >= netdev->real_num_tx_queues) return -EINVAL; } return 0; } static int netdev_nl_queue_fill(struct sk_buff *rsp, struct net_device *netdev, u32 q_idx, u32 q_type, const struct genl_info *info) { int err; if (!netdev->up) return -ENOENT; err = netdev_nl_queue_validate(netdev, q_idx, q_type); if (err) return err; return netdev_nl_queue_fill_one(rsp, netdev, q_idx, q_type, info); } int netdev_nl_queue_get_doit(struct sk_buff *skb, struct genl_info *info) { u32 q_id, q_type, ifindex; struct net_device *netdev; struct sk_buff *rsp; int err; if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_ID) || GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_TYPE) || GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_IFINDEX)) return -EINVAL; q_id = nla_get_u32(info->attrs[NETDEV_A_QUEUE_ID]); q_type = nla_get_u32(info->attrs[NETDEV_A_QUEUE_TYPE]); ifindex = nla_get_u32(info->attrs[NETDEV_A_QUEUE_IFINDEX]); rsp = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!rsp) return -ENOMEM; netdev = netdev_get_by_index_lock_ops_compat(genl_info_net(info), ifindex); if (netdev) { err = netdev_nl_queue_fill(rsp, netdev, q_id, q_type, info); netdev_unlock_ops_compat(netdev); } else { err = -ENODEV; } if (err) goto err_free_msg; return genlmsg_reply(rsp, info); err_free_msg: nlmsg_free(rsp); return err; } static int netdev_nl_queue_dump_one(struct net_device *netdev, struct sk_buff *rsp, const struct genl_info *info, struct netdev_nl_dump_ctx *ctx) { int err = 0; if (!netdev->up) return err; for (; ctx->rxq_idx < netdev->real_num_rx_queues; ctx->rxq_idx++) { err = netdev_nl_queue_fill_one(rsp, netdev, ctx->rxq_idx, NETDEV_QUEUE_TYPE_RX, info); if (err) return err; } for (; ctx->txq_idx < netdev->real_num_tx_queues; ctx->txq_idx++) { err = netdev_nl_queue_fill_one(rsp, netdev, ctx->txq_idx, NETDEV_QUEUE_TYPE_TX, info); if (err) return err; } return err; } int netdev_nl_queue_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb) { struct netdev_nl_dump_ctx *ctx = netdev_dump_ctx(cb); const struct genl_info *info = genl_info_dump(cb); struct net *net = sock_net(skb->sk); struct net_device *netdev; u32 ifindex = 0; int err = 0; if (info->attrs[NETDEV_A_QUEUE_IFINDEX]) ifindex = nla_get_u32(info->attrs[NETDEV_A_QUEUE_IFINDEX]); if (ifindex) { netdev = netdev_get_by_index_lock_ops_compat(net, ifindex); if (netdev) { err = netdev_nl_queue_dump_one(netdev, skb, info, ctx); netdev_unlock_ops_compat(netdev); } else { err = -ENODEV; } } else { for_each_netdev_lock_ops_compat_scoped(net, netdev, ctx->ifindex) { err = netdev_nl_queue_dump_one(netdev, skb, info, ctx); if (err < 0) break; ctx->rxq_idx = 0; ctx->txq_idx = 0; } } return err; } #define NETDEV_STAT_NOT_SET (~0ULL) static void netdev_nl_stats_add(void *_sum, const void *_add, size_t size) { const u64 *add = _add; u64 *sum = _sum; while (size) { if (*add != NETDEV_STAT_NOT_SET && *sum != NETDEV_STAT_NOT_SET) *sum += *add; sum++; add++; size -= 8; } } static int netdev_stat_put(struct sk_buff *rsp, unsigned int attr_id, u64 value) { if (value == NETDEV_STAT_NOT_SET) return 0; return nla_put_uint(rsp, attr_id, value); } static int netdev_nl_stats_write_rx(struct sk_buff *rsp, struct netdev_queue_stats_rx *rx) { if (netdev_stat_put(rsp, NETDEV_A_QSTATS_RX_PACKETS, rx->packets) || netdev_stat_put(rsp, NETDEV_A_QSTATS_RX_BYTES, rx->bytes) || netdev_stat_put(rsp, NETDEV_A_QSTATS_RX_ALLOC_FAIL, rx->alloc_fail) || netdev_stat_put(rsp, NETDEV_A_QSTATS_RX_HW_DROPS, rx->hw_drops) || netdev_stat_put(rsp, NETDEV_A_QSTATS_RX_HW_DROP_OVERRUNS, rx->hw_drop_overruns) || netdev_stat_put(rsp, NETDEV_A_QSTATS_RX_CSUM_COMPLETE, rx->csum_complete) || netdev_stat_put(rsp, NETDEV_A_QSTATS_RX_CSUM_UNNECESSARY, rx->csum_unnecessary) || netdev_stat_put(rsp, NETDEV_A_QSTATS_RX_CSUM_NONE, rx->csum_none) || netdev_stat_put(rsp, NETDEV_A_QSTATS_RX_CSUM_BAD, rx->csum_bad) || netdev_stat_put(rsp, NETDEV_A_QSTATS_RX_HW_GRO_PACKETS, rx->hw_gro_packets) || netdev_stat_put(rsp, NETDEV_A_QSTATS_RX_HW_GRO_BYTES, rx->hw_gro_bytes) || netdev_stat_put(rsp, NETDEV_A_QSTATS_RX_HW_GRO_WIRE_PACKETS, rx->hw_gro_wire_packets) || netdev_stat_put(rsp, NETDEV_A_QSTATS_RX_HW_GRO_WIRE_BYTES, rx->hw_gro_wire_bytes) || netdev_stat_put(rsp, NETDEV_A_QSTATS_RX_HW_DROP_RATELIMITS, rx->hw_drop_ratelimits)) return -EMSGSIZE; return 0; } static int netdev_nl_stats_write_tx(struct sk_buff *rsp, struct netdev_queue_stats_tx *tx) { if (netdev_stat_put(rsp, NETDEV_A_QSTATS_TX_PACKETS, tx->packets) || netdev_stat_put(rsp, NETDEV_A_QSTATS_TX_BYTES, tx->bytes) || netdev_stat_put(rsp, NETDEV_A_QSTATS_TX_HW_DROPS, tx->hw_drops) || netdev_stat_put(rsp, NETDEV_A_QSTATS_TX_HW_DROP_ERRORS, tx->hw_drop_errors) || netdev_stat_put(rsp, NETDEV_A_QSTATS_TX_CSUM_NONE, tx->csum_none) || netdev_stat_put(rsp, NETDEV_A_QSTATS_TX_NEEDS_CSUM, tx->needs_csum) || netdev_stat_put(rsp, NETDEV_A_QSTATS_TX_HW_GSO_PACKETS, tx->hw_gso_packets) || netdev_stat_put(rsp, NETDEV_A_QSTATS_TX_HW_GSO_BYTES, tx->hw_gso_bytes) || netdev_stat_put(rsp, NETDEV_A_QSTATS_TX_HW_GSO_WIRE_PACKETS, tx->hw_gso_wire_packets) || netdev_stat_put(rsp, NETDEV_A_QSTATS_TX_HW_GSO_WIRE_BYTES, tx->hw_gso_wire_bytes) || netdev_stat_put(rsp, NETDEV_A_QSTATS_TX_HW_DROP_RATELIMITS, tx->hw_drop_ratelimits) || netdev_stat_put(rsp, NETDEV_A_QSTATS_TX_STOP, tx->stop) || netdev_stat_put(rsp, NETDEV_A_QSTATS_TX_WAKE, tx->wake)) return -EMSGSIZE; return 0; } static int netdev_nl_stats_queue(struct net_device *netdev, struct sk_buff *rsp, u32 q_type, int i, const struct genl_info *info) { const struct netdev_stat_ops *ops = netdev->stat_ops; struct netdev_queue_stats_rx rx; struct netdev_queue_stats_tx tx; void *hdr; hdr = genlmsg_iput(rsp, info); if (!hdr) return -EMSGSIZE; if (nla_put_u32(rsp, NETDEV_A_QSTATS_IFINDEX, netdev->ifindex) || nla_put_u32(rsp, NETDEV_A_QSTATS_QUEUE_TYPE, q_type) || nla_put_u32(rsp, NETDEV_A_QSTATS_QUEUE_ID, i)) goto nla_put_failure; switch (q_type) { case NETDEV_QUEUE_TYPE_RX: memset(&rx, 0xff, sizeof(rx)); ops->get_queue_stats_rx(netdev, i, &rx); if (!memchr_inv(&rx, 0xff, sizeof(rx))) goto nla_cancel; if (netdev_nl_stats_write_rx(rsp, &rx)) goto nla_put_failure; break; case NETDEV_QUEUE_TYPE_TX: memset(&tx, 0xff, sizeof(tx)); ops->get_queue_stats_tx(netdev, i, &tx); if (!memchr_inv(&tx, 0xff, sizeof(tx))) goto nla_cancel; if (netdev_nl_stats_write_tx(rsp, &tx)) goto nla_put_failure; break; } genlmsg_end(rsp, hdr); return 0; nla_cancel: genlmsg_cancel(rsp, hdr); return 0; nla_put_failure: genlmsg_cancel(rsp, hdr); return -EMSGSIZE; } static int netdev_nl_stats_by_queue(struct net_device *netdev, struct sk_buff *rsp, const struct genl_info *info, struct netdev_nl_dump_ctx *ctx) { const struct netdev_stat_ops *ops = netdev->stat_ops; int i, err; if (!(netdev->flags & IFF_UP)) return 0; i = ctx->rxq_idx; while (ops->get_queue_stats_rx && i < netdev->real_num_rx_queues) { err = netdev_nl_stats_queue(netdev, rsp, NETDEV_QUEUE_TYPE_RX, i, info); if (err) return err; ctx->rxq_idx = ++i; } i = ctx->txq_idx; while (ops->get_queue_stats_tx && i < netdev->real_num_tx_queues) { err = netdev_nl_stats_queue(netdev, rsp, NETDEV_QUEUE_TYPE_TX, i, info); if (err) return err; ctx->txq_idx = ++i; } ctx->rxq_idx = 0; ctx->txq_idx = 0; return 0; } /** * netdev_stat_queue_sum() - add up queue stats from range of queues * @netdev: net_device * @rx_start: index of the first Rx queue to query * @rx_end: index after the last Rx queue (first *not* to query) * @rx_sum: output Rx stats, should be already initialized * @tx_start: index of the first Tx queue to query * @tx_end: index after the last Tx queue (first *not* to query) * @tx_sum: output Tx stats, should be already initialized * * Add stats from [start, end) range of queue IDs to *x_sum structs. * The sum structs must be already initialized. Usually this * helper is invoked from the .get_base_stats callbacks of drivers * to account for stats of disabled queues. In that case the ranges * are usually [netdev->real_num_*x_queues, netdev->num_*x_queues). */ void netdev_stat_queue_sum(struct net_device *netdev, int rx_start, int rx_end, struct netdev_queue_stats_rx *rx_sum, int tx_start, int tx_end, struct netdev_queue_stats_tx *tx_sum) { const struct netdev_stat_ops *ops; struct netdev_queue_stats_rx rx; struct netdev_queue_stats_tx tx; int i; ops = netdev->stat_ops; for (i = rx_start; i < rx_end; i++) { memset(&rx, 0xff, sizeof(rx)); if (ops->get_queue_stats_rx) ops->get_queue_stats_rx(netdev, i, &rx); netdev_nl_stats_add(rx_sum, &rx, sizeof(rx)); } for (i = tx_start; i < tx_end; i++) { memset(&tx, 0xff, sizeof(tx)); if (ops->get_queue_stats_tx) ops->get_queue_stats_tx(netdev, i, &tx); netdev_nl_stats_add(tx_sum, &tx, sizeof(tx)); } } EXPORT_SYMBOL(netdev_stat_queue_sum); static int netdev_nl_stats_by_netdev(struct net_device *netdev, struct sk_buff *rsp, const struct genl_info *info) { struct netdev_queue_stats_rx rx_sum; struct netdev_queue_stats_tx tx_sum; void *hdr; /* Netdev can't guarantee any complete counters */ if (!netdev->stat_ops->get_base_stats) return 0; memset(&rx_sum, 0xff, sizeof(rx_sum)); memset(&tx_sum, 0xff, sizeof(tx_sum)); netdev->stat_ops->get_base_stats(netdev, &rx_sum, &tx_sum); /* The op was there, but nothing reported, don't bother */ if (!memchr_inv(&rx_sum, 0xff, sizeof(rx_sum)) && !memchr_inv(&tx_sum, 0xff, sizeof(tx_sum))) return 0; hdr = genlmsg_iput(rsp, info); if (!hdr) return -EMSGSIZE; if (nla_put_u32(rsp, NETDEV_A_QSTATS_IFINDEX, netdev->ifindex)) goto nla_put_failure; netdev_stat_queue_sum(netdev, 0, netdev->real_num_rx_queues, &rx_sum, 0, netdev->real_num_tx_queues, &tx_sum); if (netdev_nl_stats_write_rx(rsp, &rx_sum) || netdev_nl_stats_write_tx(rsp, &tx_sum)) goto nla_put_failure; genlmsg_end(rsp, hdr); return 0; nla_put_failure: genlmsg_cancel(rsp, hdr); return -EMSGSIZE; } static int netdev_nl_qstats_get_dump_one(struct net_device *netdev, unsigned int scope, struct sk_buff *skb, const struct genl_info *info, struct netdev_nl_dump_ctx *ctx) { if (!netdev->stat_ops) return 0; switch (scope) { case 0: return netdev_nl_stats_by_netdev(netdev, skb, info); case NETDEV_QSTATS_SCOPE_QUEUE: return netdev_nl_stats_by_queue(netdev, skb, info, ctx); } return -EINVAL; /* Should not happen, per netlink policy */ } int netdev_nl_qstats_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb) { struct netdev_nl_dump_ctx *ctx = netdev_dump_ctx(cb); const struct genl_info *info = genl_info_dump(cb); struct net *net = sock_net(skb->sk); struct net_device *netdev; unsigned int ifindex; unsigned int scope; int err = 0; scope = 0; if (info->attrs[NETDEV_A_QSTATS_SCOPE]) scope = nla_get_uint(info->attrs[NETDEV_A_QSTATS_SCOPE]); ifindex = 0; if (info->attrs[NETDEV_A_QSTATS_IFINDEX]) ifindex = nla_get_u32(info->attrs[NETDEV_A_QSTATS_IFINDEX]); if (ifindex) { netdev = netdev_get_by_index_lock_ops_compat(net, ifindex); if (!netdev) { NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_QSTATS_IFINDEX]); return -ENODEV; } if (netdev->stat_ops) { err = netdev_nl_qstats_get_dump_one(netdev, scope, skb, info, ctx); } else { NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_QSTATS_IFINDEX]); err = -EOPNOTSUPP; } netdev_unlock_ops_compat(netdev); return err; } for_each_netdev_lock_ops_compat_scoped(net, netdev, ctx->ifindex) { err = netdev_nl_qstats_get_dump_one(netdev, scope, skb, info, ctx); if (err < 0) break; } return err; } int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info) { struct nlattr *tb[ARRAY_SIZE(netdev_queue_id_nl_policy)]; struct net_devmem_dmabuf_binding *binding; u32 ifindex, dmabuf_fd, rxq_idx; struct netdev_nl_sock *priv; struct net_device *netdev; struct sk_buff *rsp; struct nlattr *attr; int rem, err = 0; void *hdr; if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_DEV_IFINDEX) || GENL_REQ_ATTR_CHECK(info, NETDEV_A_DMABUF_FD) || GENL_REQ_ATTR_CHECK(info, NETDEV_A_DMABUF_QUEUES)) return -EINVAL; ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]); dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_DMABUF_FD]); priv = genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk); if (IS_ERR(priv)) return PTR_ERR(priv); rsp = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!rsp) return -ENOMEM; hdr = genlmsg_iput(rsp, info); if (!hdr) { err = -EMSGSIZE; goto err_genlmsg_free; } mutex_lock(&priv->lock); err = 0; netdev = netdev_get_by_index_lock(genl_info_net(info), ifindex); if (!netdev) { err = -ENODEV; goto err_unlock_sock; } if (!netif_device_present(netdev)) err = -ENODEV; else if (!netdev_need_ops_lock(netdev)) err = -EOPNOTSUPP; if (err) { NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_DEV_IFINDEX]); goto err_unlock; } binding = net_devmem_bind_dmabuf(netdev, DMA_FROM_DEVICE, dmabuf_fd, priv, info->extack); if (IS_ERR(binding)) { err = PTR_ERR(binding); goto err_unlock; } nla_for_each_attr_type(attr, NETDEV_A_DMABUF_QUEUES, genlmsg_data(info->genlhdr), genlmsg_len(info->genlhdr), rem) { err = nla_parse_nested( tb, ARRAY_SIZE(netdev_queue_id_nl_policy) - 1, attr, netdev_queue_id_nl_policy, info->extack); if (err < 0) goto err_unbind; if (NL_REQ_ATTR_CHECK(info->extack, attr, tb, NETDEV_A_QUEUE_ID) || NL_REQ_ATTR_CHECK(info->extack, attr, tb, NETDEV_A_QUEUE_TYPE)) { err = -EINVAL; goto err_unbind; } if (nla_get_u32(tb[NETDEV_A_QUEUE_TYPE]) != NETDEV_QUEUE_TYPE_RX) { NL_SET_BAD_ATTR(info->extack, tb[NETDEV_A_QUEUE_TYPE]); err = -EINVAL; goto err_unbind; } rxq_idx = nla_get_u32(tb[NETDEV_A_QUEUE_ID]); err = net_devmem_bind_dmabuf_to_queue(netdev, rxq_idx, binding, info->extack); if (err) goto err_unbind; } nla_put_u32(rsp, NETDEV_A_DMABUF_ID, binding->id); genlmsg_end(rsp, hdr); err = genlmsg_reply(rsp, info); if (err) goto err_unbind; netdev_unlock(netdev); mutex_unlock(&priv->lock); return 0; err_unbind: net_devmem_unbind_dmabuf(binding); err_unlock: netdev_unlock(netdev); err_unlock_sock: mutex_unlock(&priv->lock); err_genlmsg_free: nlmsg_free(rsp); return err; } int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info) { struct net_devmem_dmabuf_binding *binding; struct netdev_nl_sock *priv; struct net_device *netdev; u32 ifindex, dmabuf_fd; struct sk_buff *rsp; int err = 0; void *hdr; if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_DEV_IFINDEX) || GENL_REQ_ATTR_CHECK(info, NETDEV_A_DMABUF_FD)) return -EINVAL; ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]); dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_DMABUF_FD]); priv = genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk); if (IS_ERR(priv)) return PTR_ERR(priv); rsp = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!rsp) return -ENOMEM; hdr = genlmsg_iput(rsp, info); if (!hdr) { err = -EMSGSIZE; goto err_genlmsg_free; } mutex_lock(&priv->lock); netdev = netdev_get_by_index_lock(genl_info_net(info), ifindex); if (!netdev) { err = -ENODEV; goto err_unlock_sock; } if (!netif_device_present(netdev)) { err = -ENODEV; goto err_unlock_netdev; } if (!netdev->netmem_tx) { err = -EOPNOTSUPP; NL_SET_ERR_MSG(info->extack, "Driver does not support netmem TX"); goto err_unlock_netdev; } binding = net_devmem_bind_dmabuf(netdev, DMA_TO_DEVICE, dmabuf_fd, priv, info->extack); if (IS_ERR(binding)) { err = PTR_ERR(binding); goto err_unlock_netdev; } nla_put_u32(rsp, NETDEV_A_DMABUF_ID, binding->id); genlmsg_end(rsp, hdr); netdev_unlock(netdev); mutex_unlock(&priv->lock); return genlmsg_reply(rsp, info); err_unlock_netdev: netdev_unlock(netdev); err_unlock_sock: mutex_unlock(&priv->lock); err_genlmsg_free: nlmsg_free(rsp); return err; } void netdev_nl_sock_priv_init(struct netdev_nl_sock *priv) { INIT_LIST_HEAD(&priv->bindings); mutex_init(&priv->lock); } void netdev_nl_sock_priv_destroy(struct netdev_nl_sock *priv) { struct net_devmem_dmabuf_binding *binding; struct net_devmem_dmabuf_binding *temp; netdevice_tracker dev_tracker; struct net_device *dev; mutex_lock(&priv->lock); list_for_each_entry_safe(binding, temp, &priv->bindings, list) { mutex_lock(&binding->lock); dev = binding->dev; if (!dev) { mutex_unlock(&binding->lock); net_devmem_unbind_dmabuf(binding); continue; } netdev_hold(dev, &dev_tracker, GFP_KERNEL); mutex_unlock(&binding->lock); netdev_lock(dev); net_devmem_unbind_dmabuf(binding); netdev_unlock(dev); netdev_put(dev, &dev_tracker); } mutex_unlock(&priv->lock); } static int netdev_genl_netdevice_event(struct notifier_block *nb, unsigned long event, void *ptr) { struct net_device *netdev = netdev_notifier_info_to_dev(ptr); switch (event) { case NETDEV_REGISTER: netdev_lock_ops_to_full(netdev); netdev_genl_dev_notify(netdev, NETDEV_CMD_DEV_ADD_NTF); netdev_unlock_full_to_ops(netdev); break; case NETDEV_UNREGISTER: netdev_lock(netdev); netdev_genl_dev_notify(netdev, NETDEV_CMD_DEV_DEL_NTF); netdev_unlock(netdev); break; case NETDEV_XDP_FEAT_CHANGE: netdev_genl_dev_notify(netdev, NETDEV_CMD_DEV_CHANGE_NTF); break; } return NOTIFY_OK; } static struct notifier_block netdev_genl_nb = { .notifier_call = netdev_genl_netdevice_event, }; static int __init netdev_genl_init(void) { int err; err = register_netdevice_notifier(&netdev_genl_nb); if (err) return err; err = genl_register_family(&netdev_nl_family); if (err) goto err_unreg_ntf; return 0; err_unreg_ntf: unregister_netdevice_notifier(&netdev_genl_nb); return err; } subsys_initcall(netdev_genl_init); |
| 10 8 10 18 2 2 2 2 2 2 2 1 1 1 1 1 2 2 3 3 3 3 3 2 2 1 1 1 2 2 2 2 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Copyright (c) 2016 Mellanox Technologies. All rights reserved. * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com> */ #include "devl_internal.h" struct devlink_sb { struct list_head list; unsigned int index; u32 size; u16 ingress_pools_count; u16 egress_pools_count; u16 ingress_tc_count; u16 egress_tc_count; }; static u16 devlink_sb_pool_count(struct devlink_sb *devlink_sb) { return devlink_sb->ingress_pools_count + devlink_sb->egress_pools_count; } static struct devlink_sb *devlink_sb_get_by_index(struct devlink *devlink, unsigned int sb_index) { struct devlink_sb *devlink_sb; list_for_each_entry(devlink_sb, &devlink->sb_list, list) { if (devlink_sb->index == sb_index) return devlink_sb; } return NULL; } static bool devlink_sb_index_exists(struct devlink *devlink, unsigned int sb_index) { return devlink_sb_get_by_index(devlink, sb_index); } static struct devlink_sb *devlink_sb_get_from_attrs(struct devlink *devlink, struct nlattr **attrs) { if (attrs[DEVLINK_ATTR_SB_INDEX]) { u32 sb_index = nla_get_u32(attrs[DEVLINK_ATTR_SB_INDEX]); struct devlink_sb *devlink_sb; devlink_sb = devlink_sb_get_by_index(devlink, sb_index); if (!devlink_sb) return ERR_PTR(-ENODEV); return devlink_sb; } return ERR_PTR(-EINVAL); } static struct devlink_sb *devlink_sb_get_from_info(struct devlink *devlink, struct genl_info *info) { return devlink_sb_get_from_attrs(devlink, info->attrs); } static int devlink_sb_pool_index_get_from_attrs(struct devlink_sb *devlink_sb, struct nlattr **attrs, u16 *p_pool_index) { u16 val; if (!attrs[DEVLINK_ATTR_SB_POOL_INDEX]) return -EINVAL; val = nla_get_u16(attrs[DEVLINK_ATTR_SB_POOL_INDEX]); if (val >= devlink_sb_pool_count(devlink_sb)) return -EINVAL; *p_pool_index = val; return 0; } static int devlink_sb_pool_index_get_from_info(struct devlink_sb *devlink_sb, struct genl_info *info, u16 *p_pool_index) { return devlink_sb_pool_index_get_from_attrs(devlink_sb, info->attrs, p_pool_index); } static int devlink_sb_pool_type_get_from_attrs(struct nlattr **attrs, enum devlink_sb_pool_type *p_pool_type) { u8 val; if (!attrs[DEVLINK_ATTR_SB_POOL_TYPE]) return -EINVAL; val = nla_get_u8(attrs[DEVLINK_ATTR_SB_POOL_TYPE]); if (val != DEVLINK_SB_POOL_TYPE_INGRESS && val != DEVLINK_SB_POOL_TYPE_EGRESS) return -EINVAL; *p_pool_type = val; return 0; } static int devlink_sb_pool_type_get_from_info(struct genl_info *info, enum devlink_sb_pool_type *p_pool_type) { return devlink_sb_pool_type_get_from_attrs(info->attrs, p_pool_type); } static int devlink_sb_th_type_get_from_attrs(struct nlattr **attrs, enum devlink_sb_threshold_type *p_th_type) { u8 val; if (!attrs[DEVLINK_ATTR_SB_POOL_THRESHOLD_TYPE]) return -EINVAL; val = nla_get_u8(attrs[DEVLINK_ATTR_SB_POOL_THRESHOLD_TYPE]); if (val != DEVLINK_SB_THRESHOLD_TYPE_STATIC && val != DEVLINK_SB_THRESHOLD_TYPE_DYNAMIC) return -EINVAL; *p_th_type = val; return 0; } static int devlink_sb_th_type_get_from_info(struct genl_info *info, enum devlink_sb_threshold_type *p_th_type) { return devlink_sb_th_type_get_from_attrs(info->attrs, p_th_type); } static int devlink_sb_tc_index_get_from_attrs(struct devlink_sb *devlink_sb, struct nlattr **attrs, enum devlink_sb_pool_type pool_type, u16 *p_tc_index) { u16 val; if (!attrs[DEVLINK_ATTR_SB_TC_INDEX]) return -EINVAL; val = nla_get_u16(attrs[DEVLINK_ATTR_SB_TC_INDEX]); if (pool_type == DEVLINK_SB_POOL_TYPE_INGRESS && val >= devlink_sb->ingress_tc_count) return -EINVAL; if (pool_type == DEVLINK_SB_POOL_TYPE_EGRESS && val >= devlink_sb->egress_tc_count) return -EINVAL; *p_tc_index = val; return 0; } static int devlink_sb_tc_index_get_from_info(struct devlink_sb *devlink_sb, struct genl_info *info, enum devlink_sb_pool_type pool_type, u16 *p_tc_index) { return devlink_sb_tc_index_get_from_attrs(devlink_sb, info->attrs, pool_type, p_tc_index); } static int devlink_nl_sb_fill(struct sk_buff *msg, struct devlink *devlink, struct devlink_sb *devlink_sb, enum devlink_command cmd, u32 portid, u32 seq, int flags) { void *hdr; hdr = genlmsg_put(msg, portid, seq, &devlink_nl_family, flags, cmd); if (!hdr) return -EMSGSIZE; if (devlink_nl_put_handle(msg, devlink)) goto nla_put_failure; if (nla_put_u32(msg, DEVLINK_ATTR_SB_INDEX, devlink_sb->index)) goto nla_put_failure; if (nla_put_u32(msg, DEVLINK_ATTR_SB_SIZE, devlink_sb->size)) goto nla_put_failure; if (nla_put_u16(msg, DEVLINK_ATTR_SB_INGRESS_POOL_COUNT, devlink_sb->ingress_pools_count)) goto nla_put_failure; if (nla_put_u16(msg, DEVLINK_ATTR_SB_EGRESS_POOL_COUNT, devlink_sb->egress_pools_count)) goto nla_put_failure; if (nla_put_u16(msg, DEVLINK_ATTR_SB_INGRESS_TC_COUNT, devlink_sb->ingress_tc_count)) goto nla_put_failure; if (nla_put_u16(msg, DEVLINK_ATTR_SB_EGRESS_TC_COUNT, devlink_sb->egress_tc_count)) goto nla_put_failure; genlmsg_end(msg, hdr); return 0; nla_put_failure: genlmsg_cancel(msg, hdr); return -EMSGSIZE; } int devlink_nl_sb_get_doit(struct sk_buff *skb, struct genl_info *info) { struct devlink *devlink = info->user_ptr[0]; struct devlink_sb *devlink_sb; struct sk_buff *msg; int err; devlink_sb = devlink_sb_get_from_info(devlink, info); if (IS_ERR(devlink_sb)) return PTR_ERR(devlink_sb); msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return -ENOMEM; err = devlink_nl_sb_fill(msg, devlink, devlink_sb, DEVLINK_CMD_SB_NEW, info->snd_portid, info->snd_seq, 0); if (err) { nlmsg_free(msg); return err; } return genlmsg_reply(msg, info); } static int devlink_nl_sb_get_dump_one(struct sk_buff *msg, struct devlink *devlink, struct netlink_callback *cb, int flags) { struct devlink_nl_dump_state *state = devlink_dump_state(cb); struct devlink_sb *devlink_sb; int idx = 0; int err = 0; list_for_each_entry(devlink_sb, &devlink->sb_list, list) { if (idx < state->idx) { idx++; continue; } err = devlink_nl_sb_fill(msg, devlink, devlink_sb, DEVLINK_CMD_SB_NEW, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, flags); if (err) { state->idx = idx; break; } idx++; } return err; } int devlink_nl_sb_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb) { return devlink_nl_dumpit(skb, cb, devlink_nl_sb_get_dump_one); } static int devlink_nl_sb_pool_fill(struct sk_buff *msg, struct devlink *devlink, struct devlink_sb *devlink_sb, u16 pool_index, enum devlink_command cmd, u32 portid, u32 seq, int flags) { struct devlink_sb_pool_info pool_info; void *hdr; int err; err = devlink->ops->sb_pool_get(devlink, devlink_sb->index, pool_index, &pool_info); if (err) return err; hdr = genlmsg_put(msg, portid, seq, &devlink_nl_family, flags, cmd); if (!hdr) return -EMSGSIZE; if (devlink_nl_put_handle(msg, devlink)) goto nla_put_failure; if (nla_put_u32(msg, DEVLINK_ATTR_SB_INDEX, devlink_sb->index)) goto nla_put_failure; if (nla_put_u16(msg, DEVLINK_ATTR_SB_POOL_INDEX, pool_index)) goto nla_put_failure; if (nla_put_u8(msg, DEVLINK_ATTR_SB_POOL_TYPE, pool_info.pool_type)) goto nla_put_failure; if (nla_put_u32(msg, DEVLINK_ATTR_SB_POOL_SIZE, pool_info.size)) goto nla_put_failure; if (nla_put_u8(msg, DEVLINK_ATTR_SB_POOL_THRESHOLD_TYPE, pool_info.threshold_type)) goto nla_put_failure; if (nla_put_u32(msg, DEVLINK_ATTR_SB_POOL_CELL_SIZE, pool_info.cell_size)) goto nla_put_failure; genlmsg_end(msg, hdr); return 0; nla_put_failure: genlmsg_cancel(msg, hdr); return -EMSGSIZE; } int devlink_nl_sb_pool_get_doit(struct sk_buff *skb, struct genl_info *info) { struct devlink *devlink = info->user_ptr[0]; struct devlink_sb *devlink_sb; struct sk_buff *msg; u16 pool_index; int err; devlink_sb = devlink_sb_get_from_info(devlink, info); if (IS_ERR(devlink_sb)) return PTR_ERR(devlink_sb); err = devlink_sb_pool_index_get_from_info(devlink_sb, info, &pool_index); if (err) return err; if (!devlink->ops->sb_pool_get) return -EOPNOTSUPP; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return -ENOMEM; err = devlink_nl_sb_pool_fill(msg, devlink, devlink_sb, pool_index, DEVLINK_CMD_SB_POOL_NEW, info->snd_portid, info->snd_seq, 0); if (err) { nlmsg_free(msg); return err; } return genlmsg_reply(msg, info); } static int __sb_pool_get_dumpit(struct sk_buff *msg, int start, int *p_idx, struct devlink *devlink, struct devlink_sb *devlink_sb, u32 portid, u32 seq, int flags) { u16 pool_count = devlink_sb_pool_count(devlink_sb); u16 pool_index; int err; for (pool_index = 0; pool_index < pool_count; pool_index++) { if (*p_idx < start) { (*p_idx)++; continue; } err = devlink_nl_sb_pool_fill(msg, devlink, devlink_sb, pool_index, DEVLINK_CMD_SB_POOL_NEW, portid, seq, flags); if (err) return err; (*p_idx)++; } return 0; } static int devlink_nl_sb_pool_get_dump_one(struct sk_buff *msg, struct devlink *devlink, struct netlink_callback *cb, int flags) { struct devlink_nl_dump_state *state = devlink_dump_state(cb); struct devlink_sb *devlink_sb; int err = 0; int idx = 0; if (!devlink->ops->sb_pool_get) return 0; list_for_each_entry(devlink_sb, &devlink->sb_list, list) { err = __sb_pool_get_dumpit(msg, state->idx, &idx, devlink, devlink_sb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, flags); if (err == -EOPNOTSUPP) { err = 0; } else if (err) { state->idx = idx; break; } } return err; } int devlink_nl_sb_pool_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb) { return devlink_nl_dumpit(skb, cb, devlink_nl_sb_pool_get_dump_one); } static int devlink_sb_pool_set(struct devlink *devlink, unsigned int sb_index, u16 pool_index, u32 size, enum devlink_sb_threshold_type threshold_type, struct netlink_ext_ack *extack) { const struct devlink_ops *ops = devlink->ops; if (ops->sb_pool_set) return ops->sb_pool_set(devlink, sb_index, pool_index, size, threshold_type, extack); return -EOPNOTSUPP; } int devlink_nl_sb_pool_set_doit(struct sk_buff *skb, struct genl_info *info) { struct devlink *devlink = info->user_ptr[0]; enum devlink_sb_threshold_type threshold_type; struct devlink_sb *devlink_sb; u16 pool_index; u32 size; int err; devlink_sb = devlink_sb_get_from_info(devlink, info); if (IS_ERR(devlink_sb)) return PTR_ERR(devlink_sb); err = devlink_sb_pool_index_get_from_info(devlink_sb, info, &pool_index); if (err) return err; err = devlink_sb_th_type_get_from_info(info, &threshold_type); if (err) return err; if (GENL_REQ_ATTR_CHECK(info, DEVLINK_ATTR_SB_POOL_SIZE)) return -EINVAL; size = nla_get_u32(info->attrs[DEVLINK_ATTR_SB_POOL_SIZE]); return devlink_sb_pool_set(devlink, devlink_sb->index, pool_index, size, threshold_type, info->extack); } static int devlink_nl_sb_port_pool_fill(struct sk_buff *msg, struct devlink *devlink, struct devlink_port *devlink_port, struct devlink_sb *devlink_sb, u16 pool_index, enum devlink_command cmd, u32 portid, u32 seq, int flags) { const struct devlink_ops *ops = devlink->ops; u32 threshold; void *hdr; int err; err = ops->sb_port_pool_get(devlink_port, devlink_sb->index, pool_index, &threshold); if (err) return err; hdr = genlmsg_put(msg, portid, seq, &devlink_nl_family, flags, cmd); if (!hdr) return -EMSGSIZE; if (devlink_nl_put_handle(msg, devlink)) goto nla_put_failure; if (nla_put_u32(msg, DEVLINK_ATTR_PORT_INDEX, devlink_port->index)) goto nla_put_failure; if (nla_put_u32(msg, DEVLINK_ATTR_SB_INDEX, devlink_sb->index)) goto nla_put_failure; if (nla_put_u16(msg, DEVLINK_ATTR_SB_POOL_INDEX, pool_index)) goto nla_put_failure; if (nla_put_u32(msg, DEVLINK_ATTR_SB_THRESHOLD, threshold)) goto nla_put_failure; if (ops->sb_occ_port_pool_get) { u32 cur; u32 max; err = ops->sb_occ_port_pool_get(devlink_port, devlink_sb->index, pool_index, &cur, &max); if (err && err != -EOPNOTSUPP) goto sb_occ_get_failure; if (!err) { if (nla_put_u32(msg, DEVLINK_ATTR_SB_OCC_CUR, cur)) goto nla_put_failure; if (nla_put_u32(msg, DEVLINK_ATTR_SB_OCC_MAX, max)) goto nla_put_failure; } } genlmsg_end(msg, hdr); return 0; nla_put_failure: err = -EMSGSIZE; sb_occ_get_failure: genlmsg_cancel(msg, hdr); return err; } int devlink_nl_sb_port_pool_get_doit(struct sk_buff *skb, struct genl_info *info) { struct devlink_port *devlink_port = info->user_ptr[1]; struct devlink *devlink = devlink_port->devlink; struct devlink_sb *devlink_sb; struct sk_buff *msg; u16 pool_index; int err; devlink_sb = devlink_sb_get_from_info(devlink, info); if (IS_ERR(devlink_sb)) return PTR_ERR(devlink_sb); err = devlink_sb_pool_index_get_from_info(devlink_sb, info, &pool_index); if (err) return err; if (!devlink->ops->sb_port_pool_get) return -EOPNOTSUPP; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return -ENOMEM; err = devlink_nl_sb_port_pool_fill(msg, devlink, devlink_port, devlink_sb, pool_index, DEVLINK_CMD_SB_PORT_POOL_NEW, info->snd_portid, info->snd_seq, 0); if (err) { nlmsg_free(msg); return err; } return genlmsg_reply(msg, info); } static int __sb_port_pool_get_dumpit(struct sk_buff *msg, int start, int *p_idx, struct devlink *devlink, struct devlink_sb *devlink_sb, u32 portid, u32 seq, int flags) { struct devlink_port *devlink_port; u16 pool_count = devlink_sb_pool_count(devlink_sb); unsigned long port_index; u16 pool_index; int err; xa_for_each(&devlink->ports, port_index, devlink_port) { for (pool_index = 0; pool_index < pool_count; pool_index++) { if (*p_idx < start) { (*p_idx)++; continue; } err = devlink_nl_sb_port_pool_fill(msg, devlink, devlink_port, devlink_sb, pool_index, DEVLINK_CMD_SB_PORT_POOL_NEW, portid, seq, flags); if (err) return err; (*p_idx)++; } } return 0; } static int devlink_nl_sb_port_pool_get_dump_one(struct sk_buff *msg, struct devlink *devlink, struct netlink_callback *cb, int flags) { struct devlink_nl_dump_state *state = devlink_dump_state(cb); struct devlink_sb *devlink_sb; int idx = 0; int err = 0; if (!devlink->ops->sb_port_pool_get) return 0; list_for_each_entry(devlink_sb, &devlink->sb_list, list) { err = __sb_port_pool_get_dumpit(msg, state->idx, &idx, devlink, devlink_sb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, flags); if (err == -EOPNOTSUPP) { err = 0; } else if (err) { state->idx = idx; break; } } return err; } int devlink_nl_sb_port_pool_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb) { return devlink_nl_dumpit(skb, cb, devlink_nl_sb_port_pool_get_dump_one); } static int devlink_sb_port_pool_set(struct devlink_port *devlink_port, unsigned int sb_index, u16 pool_index, u32 threshold, struct netlink_ext_ack *extack) { const struct devlink_ops *ops = devlink_port->devlink->ops; if (ops->sb_port_pool_set) return ops->sb_port_pool_set(devlink_port, sb_index, pool_index, threshold, extack); return -EOPNOTSUPP; } int devlink_nl_sb_port_pool_set_doit(struct sk_buff *skb, struct genl_info *info) { struct devlink_port *devlink_port = info->user_ptr[1]; struct devlink *devlink = info->user_ptr[0]; struct devlink_sb *devlink_sb; u16 pool_index; u32 threshold; int err; devlink_sb = devlink_sb_get_from_info(devlink, info); if (IS_ERR(devlink_sb)) return PTR_ERR(devlink_sb); err = devlink_sb_pool_index_get_from_info(devlink_sb, info, &pool_index); if (err) return err; if (GENL_REQ_ATTR_CHECK(info, DEVLINK_ATTR_SB_THRESHOLD)) return -EINVAL; threshold = nla_get_u32(info->attrs[DEVLINK_ATTR_SB_THRESHOLD]); return devlink_sb_port_pool_set(devlink_port, devlink_sb->index, pool_index, threshold, info->extack); } static int devlink_nl_sb_tc_pool_bind_fill(struct sk_buff *msg, struct devlink *devlink, struct devlink_port *devlink_port, struct devlink_sb *devlink_sb, u16 tc_index, enum devlink_sb_pool_type pool_type, enum devlink_command cmd, u32 portid, u32 seq, int flags) { const struct devlink_ops *ops = devlink->ops; u16 pool_index; u32 threshold; void *hdr; int err; err = ops->sb_tc_pool_bind_get(devlink_port, devlink_sb->index, tc_index, pool_type, &pool_index, &threshold); if (err) return err; hdr = genlmsg_put(msg, portid, seq, &devlink_nl_family, flags, cmd); if (!hdr) return -EMSGSIZE; if (devlink_nl_put_handle(msg, devlink)) goto nla_put_failure; if (nla_put_u32(msg, DEVLINK_ATTR_PORT_INDEX, devlink_port->index)) goto nla_put_failure; if (nla_put_u32(msg, DEVLINK_ATTR_SB_INDEX, devlink_sb->index)) goto nla_put_failure; if (nla_put_u16(msg, DEVLINK_ATTR_SB_TC_INDEX, tc_index)) goto nla_put_failure; if (nla_put_u8(msg, DEVLINK_ATTR_SB_POOL_TYPE, pool_type)) goto nla_put_failure; if (nla_put_u16(msg, DEVLINK_ATTR_SB_POOL_INDEX, pool_index)) goto nla_put_failure; if (nla_put_u32(msg, DEVLINK_ATTR_SB_THRESHOLD, threshold)) goto nla_put_failure; if (ops->sb_occ_tc_port_bind_get) { u32 cur; u32 max; err = ops->sb_occ_tc_port_bind_get(devlink_port, devlink_sb->index, tc_index, pool_type, &cur, &max); if (err && err != -EOPNOTSUPP) return err; if (!err) { if (nla_put_u32(msg, DEVLINK_ATTR_SB_OCC_CUR, cur)) goto nla_put_failure; if (nla_put_u32(msg, DEVLINK_ATTR_SB_OCC_MAX, max)) goto nla_put_failure; } } genlmsg_end(msg, hdr); return 0; nla_put_failure: genlmsg_cancel(msg, hdr); return -EMSGSIZE; } int devlink_nl_sb_tc_pool_bind_get_doit(struct sk_buff *skb, struct genl_info *info) { struct devlink_port *devlink_port = info->user_ptr[1]; struct devlink *devlink = devlink_port->devlink; struct devlink_sb *devlink_sb; struct sk_buff *msg; enum devlink_sb_pool_type pool_type; u16 tc_index; int err; devlink_sb = devlink_sb_get_from_info(devlink, info); if (IS_ERR(devlink_sb)) return PTR_ERR(devlink_sb); err = devlink_sb_pool_type_get_from_info(info, &pool_type); if (err) return err; err = devlink_sb_tc_index_get_from_info(devlink_sb, info, pool_type, &tc_index); if (err) return err; if (!devlink->ops->sb_tc_pool_bind_get) return -EOPNOTSUPP; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return -ENOMEM; err = devlink_nl_sb_tc_pool_bind_fill(msg, devlink, devlink_port, devlink_sb, tc_index, pool_type, DEVLINK_CMD_SB_TC_POOL_BIND_NEW, info->snd_portid, info->snd_seq, 0); if (err) { nlmsg_free(msg); return err; } return genlmsg_reply(msg, info); } static int __sb_tc_pool_bind_get_dumpit(struct sk_buff *msg, int start, int *p_idx, struct devlink *devlink, struct devlink_sb *devlink_sb, u32 portid, u32 seq, int flags) { struct devlink_port *devlink_port; unsigned long port_index; u16 tc_index; int err; xa_for_each(&devlink->ports, port_index, devlink_port) { for (tc_index = 0; tc_index < devlink_sb->ingress_tc_count; tc_index++) { if (*p_idx < start) { (*p_idx)++; continue; } err = devlink_nl_sb_tc_pool_bind_fill(msg, devlink, devlink_port, devlink_sb, tc_index, DEVLINK_SB_POOL_TYPE_INGRESS, DEVLINK_CMD_SB_TC_POOL_BIND_NEW, portid, seq, flags); if (err) return err; (*p_idx)++; } for (tc_index = 0; tc_index < devlink_sb->egress_tc_count; tc_index++) { if (*p_idx < start) { (*p_idx)++; continue; } err = devlink_nl_sb_tc_pool_bind_fill(msg, devlink, devlink_port, devlink_sb, tc_index, DEVLINK_SB_POOL_TYPE_EGRESS, DEVLINK_CMD_SB_TC_POOL_BIND_NEW, portid, seq, flags); if (err) return err; (*p_idx)++; } } return 0; } static int devlink_nl_sb_tc_pool_bind_get_dump_one(struct sk_buff *msg, struct devlink *devlink, struct netlink_callback *cb, int flags) { struct devlink_nl_dump_state *state = devlink_dump_state(cb); struct devlink_sb *devlink_sb; int idx = 0; int err = 0; if (!devlink->ops->sb_tc_pool_bind_get) return 0; list_for_each_entry(devlink_sb, &devlink->sb_list, list) { err = __sb_tc_pool_bind_get_dumpit(msg, state->idx, &idx, devlink, devlink_sb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, flags); if (err == -EOPNOTSUPP) { err = 0; } else if (err) { state->idx = idx; break; } } return err; } int devlink_nl_sb_tc_pool_bind_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb) { return devlink_nl_dumpit(skb, cb, devlink_nl_sb_tc_pool_bind_get_dump_one); } static int devlink_sb_tc_pool_bind_set(struct devlink_port *devlink_port, unsigned int sb_index, u16 tc_index, enum devlink_sb_pool_type pool_type, u16 pool_index, u32 threshold, struct netlink_ext_ack *extack) { const struct devlink_ops *ops = devlink_port->devlink->ops; if (ops->sb_tc_pool_bind_set) return ops->sb_tc_pool_bind_set(devlink_port, sb_index, tc_index, pool_type, pool_index, threshold, extack); return -EOPNOTSUPP; } int devlink_nl_sb_tc_pool_bind_set_doit(struct sk_buff *skb, struct genl_info *info) { struct devlink_port *devlink_port = info->user_ptr[1]; struct devlink *devlink = info->user_ptr[0]; enum devlink_sb_pool_type pool_type; struct devlink_sb *devlink_sb; u16 tc_index; u16 pool_index; u32 threshold; int err; devlink_sb = devlink_sb_get_from_info(devlink, info); if (IS_ERR(devlink_sb)) return PTR_ERR(devlink_sb); err = devlink_sb_pool_type_get_from_info(info, &pool_type); if (err) return err; err = devlink_sb_tc_index_get_from_info(devlink_sb, info, pool_type, &tc_index); if (err) return err; err = devlink_sb_pool_index_get_from_info(devlink_sb, info, &pool_index); if (err) return err; if (GENL_REQ_ATTR_CHECK(info, DEVLINK_ATTR_SB_THRESHOLD)) return -EINVAL; threshold = nla_get_u32(info->attrs[DEVLINK_ATTR_SB_THRESHOLD]); return devlink_sb_tc_pool_bind_set(devlink_port, devlink_sb->index, tc_index, pool_type, pool_index, threshold, info->extack); } int devlink_nl_sb_occ_snapshot_doit(struct sk_buff *skb, struct genl_info *info) { struct devlink *devlink = info->user_ptr[0]; const struct devlink_ops *ops = devlink->ops; struct devlink_sb *devlink_sb; devlink_sb = devlink_sb_get_from_info(devlink, info); if (IS_ERR(devlink_sb)) return PTR_ERR(devlink_sb); if (ops->sb_occ_snapshot) return ops->sb_occ_snapshot(devlink, devlink_sb->index); return -EOPNOTSUPP; } int devlink_nl_sb_occ_max_clear_doit(struct sk_buff *skb, struct genl_info *info) { struct devlink *devlink = info->user_ptr[0]; const struct devlink_ops *ops = devlink->ops; struct devlink_sb *devlink_sb; devlink_sb = devlink_sb_get_from_info(devlink, info); if (IS_ERR(devlink_sb)) return PTR_ERR(devlink_sb); if (ops->sb_occ_max_clear) return ops->sb_occ_max_clear(devlink, devlink_sb->index); return -EOPNOTSUPP; } int devl_sb_register(struct devlink *devlink, unsigned int sb_index, u32 size, u16 ingress_pools_count, u16 egress_pools_count, u16 ingress_tc_count, u16 egress_tc_count) { struct devlink_sb *devlink_sb; lockdep_assert_held(&devlink->lock); if (devlink_sb_index_exists(devlink, sb_index)) return -EEXIST; devlink_sb = kzalloc(sizeof(*devlink_sb), GFP_KERNEL); if (!devlink_sb) return -ENOMEM; devlink_sb->index = sb_index; devlink_sb->size = size; devlink_sb->ingress_pools_count = ingress_pools_count; devlink_sb->egress_pools_count = egress_pools_count; devlink_sb->ingress_tc_count = ingress_tc_count; devlink_sb->egress_tc_count = egress_tc_count; list_add_tail(&devlink_sb->list, &devlink->sb_list); return 0; } EXPORT_SYMBOL_GPL(devl_sb_register); int devlink_sb_register(struct devlink *devlink, unsigned int sb_index, u32 size, u16 ingress_pools_count, u16 egress_pools_count, u16 ingress_tc_count, u16 egress_tc_count) { int err; devl_lock(devlink); err = devl_sb_register(devlink, sb_index, size, ingress_pools_count, egress_pools_count, ingress_tc_count, egress_tc_count); devl_unlock(devlink); return err; } EXPORT_SYMBOL_GPL(devlink_sb_register); void devl_sb_unregister(struct devlink *devlink, unsigned int sb_index) { struct devlink_sb *devlink_sb; lockdep_assert_held(&devlink->lock); devlink_sb = devlink_sb_get_by_index(devlink, sb_index); WARN_ON(!devlink_sb); list_del(&devlink_sb->list); kfree(devlink_sb); } EXPORT_SYMBOL_GPL(devl_sb_unregister); void devlink_sb_unregister(struct devlink *devlink, unsigned int sb_index) { devl_lock(devlink); devl_sb_unregister(devlink, sb_index); devl_unlock(devlink); } EXPORT_SYMBOL_GPL(devlink_sb_unregister); |
| 12 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _ASM_X86_IOBITMAP_H #define _ASM_X86_IOBITMAP_H #include <linux/refcount.h> #include <asm/processor.h> struct io_bitmap { u64 sequence; refcount_t refcnt; /* The maximum number of bytes to copy so all zero bits are covered */ unsigned int max; unsigned long bitmap[IO_BITMAP_LONGS]; }; struct task_struct; #ifdef CONFIG_X86_IOPL_IOPERM void io_bitmap_share(struct task_struct *tsk); void io_bitmap_exit(struct task_struct *tsk); static inline void native_tss_invalidate_io_bitmap(void) { /* * Invalidate the I/O bitmap by moving io_bitmap_base outside the * TSS limit so any subsequent I/O access from user space will * trigger a #GP. * * This is correct even when VMEXIT rewrites the TSS limit * to 0x67 as the only requirement is that the base points * outside the limit. */ this_cpu_write(cpu_tss_rw.x86_tss.io_bitmap_base, IO_BITMAP_OFFSET_INVALID); } void native_tss_update_io_bitmap(void); #ifdef CONFIG_PARAVIRT_XXL #include <asm/paravirt.h> #else #define tss_update_io_bitmap native_tss_update_io_bitmap #define tss_invalidate_io_bitmap native_tss_invalidate_io_bitmap #endif #else static inline void io_bitmap_share(struct task_struct *tsk) { } static inline void io_bitmap_exit(struct task_struct *tsk) { } static inline void tss_update_io_bitmap(void) { } #endif #endif |
| 3375 3412 3377 3381 36 36 3 1 1 1 1 7 21 29 28 28 1 1 12 8 8 47 4 44 44 12 12 1 12 12 15 1 10 1 4 12 12 12 12 12 12 12 2 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 | // SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */ #include <linux/bpf.h> #include <linux/btf.h> #include <linux/err.h> #include "linux/filter.h" #include <linux/btf_ids.h> #include <linux/vmalloc.h> #include <linux/pagemap.h> #include "range_tree.h" /* * bpf_arena is a sparsely populated shared memory region between bpf program and * user space process. * * For example on x86-64 the values could be: * user_vm_start 7f7d26200000 // picked by mmap() * kern_vm_start ffffc90001e69000 // picked by get_vm_area() * For user space all pointers within the arena are normal 8-byte addresses. * In this example 7f7d26200000 is the address of the first page (pgoff=0). * The bpf program will access it as: kern_vm_start + lower_32bit_of_user_ptr * (u32)7f7d26200000 -> 26200000 * hence * ffffc90001e69000 + 26200000 == ffffc90028069000 is "pgoff=0" within 4Gb * kernel memory region. * * BPF JITs generate the following code to access arena: * mov eax, eax // eax has lower 32-bit of user pointer * mov word ptr [rax + r12 + off], bx * where r12 == kern_vm_start and off is s16. * Hence allocate 4Gb + GUARD_SZ/2 on each side. * * Initially kernel vm_area and user vma are not populated. * User space can fault-in any address which will insert the page * into kernel and user vma. * bpf program can allocate a page via bpf_arena_alloc_pages() kfunc * which will insert it into kernel vm_area. * The later fault-in from user space will populate that page into user vma. */ /* number of bytes addressable by LDX/STX insn with 16-bit 'off' field */ #define GUARD_SZ round_up(1ull << sizeof_field(struct bpf_insn, off) * 8, PAGE_SIZE << 1) #define KERN_VM_SZ (SZ_4G + GUARD_SZ) struct bpf_arena { struct bpf_map map; u64 user_vm_start; u64 user_vm_end; struct vm_struct *kern_vm; struct range_tree rt; struct list_head vma_list; struct mutex lock; }; u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena) { return arena ? (u64) (long) arena->kern_vm->addr + GUARD_SZ / 2 : 0; } u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena) { return arena ? arena->user_vm_start : 0; } static long arena_map_peek_elem(struct bpf_map *map, void *value) { return -EOPNOTSUPP; } static long arena_map_push_elem(struct bpf_map *map, void *value, u64 flags) { return -EOPNOTSUPP; } static long arena_map_pop_elem(struct bpf_map *map, void *value) { return -EOPNOTSUPP; } static long arena_map_delete_elem(struct bpf_map *map, void *value) { return -EOPNOTSUPP; } static int arena_map_get_next_key(struct bpf_map *map, void *key, void *next_key) { return -EOPNOTSUPP; } static long compute_pgoff(struct bpf_arena *arena, long uaddr) { return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT; } static struct bpf_map *arena_map_alloc(union bpf_attr *attr) { struct vm_struct *kern_vm; int numa_node = bpf_map_attr_numa_node(attr); struct bpf_arena *arena; u64 vm_range; int err = -ENOMEM; if (!bpf_jit_supports_arena()) return ERR_PTR(-EOPNOTSUPP); if (attr->key_size || attr->value_size || attr->max_entries == 0 || /* BPF_F_MMAPABLE must be set */ !(attr->map_flags & BPF_F_MMAPABLE) || /* No unsupported flags present */ (attr->map_flags & ~(BPF_F_SEGV_ON_FAULT | BPF_F_MMAPABLE | BPF_F_NO_USER_CONV))) return ERR_PTR(-EINVAL); if (attr->map_extra & ~PAGE_MASK) /* If non-zero the map_extra is an expected user VMA start address */ return ERR_PTR(-EINVAL); vm_range = (u64)attr->max_entries * PAGE_SIZE; if (vm_range > SZ_4G) return ERR_PTR(-E2BIG); if ((attr->map_extra >> 32) != ((attr->map_extra + vm_range - 1) >> 32)) /* user vma must not cross 32-bit boundary */ return ERR_PTR(-ERANGE); kern_vm = get_vm_area(KERN_VM_SZ, VM_SPARSE | VM_USERMAP); if (!kern_vm) return ERR_PTR(-ENOMEM); arena = bpf_map_area_alloc(sizeof(*arena), numa_node); if (!arena) goto err; arena->kern_vm = kern_vm; arena->user_vm_start = attr->map_extra; if (arena->user_vm_start) arena->user_vm_end = arena->user_vm_start + vm_range; INIT_LIST_HEAD(&arena->vma_list); bpf_map_init_from_attr(&arena->map, attr); range_tree_init(&arena->rt); err = range_tree_set(&arena->rt, 0, attr->max_entries); if (err) { bpf_map_area_free(arena); goto err; } mutex_init(&arena->lock); return &arena->map; err: free_vm_area(kern_vm); return ERR_PTR(err); } static int existing_page_cb(pte_t *ptep, unsigned long addr, void *data) { struct page *page; pte_t pte; pte = ptep_get(ptep); if (!pte_present(pte)) /* sanity check */ return 0; page = pte_page(pte); /* * We do not update pte here: * 1. Nobody should be accessing bpf_arena's range outside of a kernel bug * 2. TLB flushing is batched or deferred. Even if we clear pte, * the TLB entries can stick around and continue to permit access to * the freed page. So it all relies on 1. */ __free_page(page); return 0; } static void arena_map_free(struct bpf_map *map) { struct bpf_arena *arena = container_of(map, struct bpf_arena, map); /* * Check that user vma-s are not around when bpf map is freed. * mmap() holds vm_file which holds bpf_map refcnt. * munmap() must have happened on vma followed by arena_vm_close() * which would clear arena->vma_list. */ if (WARN_ON_ONCE(!list_empty(&arena->vma_list))) return; /* * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area(). * It unmaps everything from vmalloc area and clears pgtables. * Call apply_to_existing_page_range() first to find populated ptes and * free those pages. */ apply_to_existing_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena), KERN_VM_SZ - GUARD_SZ, existing_page_cb, NULL); free_vm_area(arena->kern_vm); range_tree_destroy(&arena->rt); bpf_map_area_free(arena); } static void *arena_map_lookup_elem(struct bpf_map *map, void *key) { return ERR_PTR(-EINVAL); } static long arena_map_update_elem(struct bpf_map *map, void *key, void *value, u64 flags) { return -EOPNOTSUPP; } static int arena_map_check_btf(const struct bpf_map *map, const struct btf *btf, const struct btf_type *key_type, const struct btf_type *value_type) { return 0; } static u64 arena_map_mem_usage(const struct bpf_map *map) { return 0; } struct vma_list { struct vm_area_struct *vma; struct list_head head; refcount_t mmap_count; }; static int remember_vma(struct bpf_arena *arena, struct vm_area_struct *vma) { struct vma_list *vml; vml = kmalloc(sizeof(*vml), GFP_KERNEL); if (!vml) return -ENOMEM; refcount_set(&vml->mmap_count, 1); vma->vm_private_data = vml; vml->vma = vma; list_add(&vml->head, &arena->vma_list); return 0; } static void arena_vm_open(struct vm_area_struct *vma) { struct vma_list *vml = vma->vm_private_data; refcount_inc(&vml->mmap_count); } static void arena_vm_close(struct vm_area_struct *vma) { struct bpf_map *map = vma->vm_file->private_data; struct bpf_arena *arena = container_of(map, struct bpf_arena, map); struct vma_list *vml = vma->vm_private_data; if (!refcount_dec_and_test(&vml->mmap_count)) return; guard(mutex)(&arena->lock); /* update link list under lock */ list_del(&vml->head); vma->vm_private_data = NULL; kfree(vml); } static vm_fault_t arena_vm_fault(struct vm_fault *vmf) { struct bpf_map *map = vmf->vma->vm_file->private_data; struct bpf_arena *arena = container_of(map, struct bpf_arena, map); struct page *page; long kbase, kaddr; int ret; kbase = bpf_arena_get_kern_vm_start(arena); kaddr = kbase + (u32)(vmf->address); guard(mutex)(&arena->lock); page = vmalloc_to_page((void *)kaddr); if (page) /* already have a page vmap-ed */ goto out; if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT) /* User space requested to segfault when page is not allocated by bpf prog */ return VM_FAULT_SIGSEGV; ret = range_tree_clear(&arena->rt, vmf->pgoff, 1); if (ret) return VM_FAULT_SIGSEGV; /* Account into memcg of the process that created bpf_arena */ ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page); if (ret) { range_tree_set(&arena->rt, vmf->pgoff, 1); return VM_FAULT_SIGSEGV; } ret = vm_area_map_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE, &page); if (ret) { range_tree_set(&arena->rt, vmf->pgoff, 1); __free_page(page); return VM_FAULT_SIGSEGV; } out: page_ref_add(page, 1); vmf->page = page; return 0; } static const struct vm_operations_struct arena_vm_ops = { .open = arena_vm_open, .close = arena_vm_close, .fault = arena_vm_fault, }; static unsigned long arena_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { struct bpf_map *map = filp->private_data; struct bpf_arena *arena = container_of(map, struct bpf_arena, map); long ret; if (pgoff) return -EINVAL; if (len > SZ_4G) return -E2BIG; /* if user_vm_start was specified at arena creation time */ if (arena->user_vm_start) { if (len > arena->user_vm_end - arena->user_vm_start) return -E2BIG; if (len != arena->user_vm_end - arena->user_vm_start) return -EINVAL; if (addr != arena->user_vm_start) return -EINVAL; } ret = mm_get_unmapped_area(current->mm, filp, addr, len * 2, 0, flags); if (IS_ERR_VALUE(ret)) return ret; if ((ret >> 32) == ((ret + len - 1) >> 32)) return ret; if (WARN_ON_ONCE(arena->user_vm_start)) /* checks at map creation time should prevent this */ return -EFAULT; return round_up(ret, SZ_4G); } static int arena_map_mmap(struct bpf_map *map, struct vm_area_struct *vma) { struct bpf_arena *arena = container_of(map, struct bpf_arena, map); guard(mutex)(&arena->lock); if (arena->user_vm_start && arena->user_vm_start != vma->vm_start) /* * If map_extra was not specified at arena creation time then * 1st user process can do mmap(NULL, ...) to pick user_vm_start * 2nd user process must pass the same addr to mmap(addr, MAP_FIXED..); * or * specify addr in map_extra and * use the same addr later with mmap(addr, MAP_FIXED..); */ return -EBUSY; if (arena->user_vm_end && arena->user_vm_end != vma->vm_end) /* all user processes must have the same size of mmap-ed region */ return -EBUSY; /* Earlier checks should prevent this */ if (WARN_ON_ONCE(vma->vm_end - vma->vm_start > SZ_4G || vma->vm_pgoff)) return -EFAULT; if (remember_vma(arena, vma)) return -ENOMEM; arena->user_vm_start = vma->vm_start; arena->user_vm_end = vma->vm_end; /* * bpf_map_mmap() checks that it's being mmaped as VM_SHARED and * clears VM_MAYEXEC. Set VM_DONTEXPAND as well to avoid * potential change of user_vm_start. */ vm_flags_set(vma, VM_DONTEXPAND); vma->vm_ops = &arena_vm_ops; return 0; } static int arena_map_direct_value_addr(const struct bpf_map *map, u64 *imm, u32 off) { struct bpf_arena *arena = container_of(map, struct bpf_arena, map); if ((u64)off > arena->user_vm_end - arena->user_vm_start) return -ERANGE; *imm = (unsigned long)arena->user_vm_start; return 0; } BTF_ID_LIST_SINGLE(bpf_arena_map_btf_ids, struct, bpf_arena) const struct bpf_map_ops arena_map_ops = { .map_meta_equal = bpf_map_meta_equal, .map_alloc = arena_map_alloc, .map_free = arena_map_free, .map_direct_value_addr = arena_map_direct_value_addr, .map_mmap = arena_map_mmap, .map_get_unmapped_area = arena_get_unmapped_area, .map_get_next_key = arena_map_get_next_key, .map_push_elem = arena_map_push_elem, .map_peek_elem = arena_map_peek_elem, .map_pop_elem = arena_map_pop_elem, .map_lookup_elem = arena_map_lookup_elem, .map_update_elem = arena_map_update_elem, .map_delete_elem = arena_map_delete_elem, .map_check_btf = arena_map_check_btf, .map_mem_usage = arena_map_mem_usage, .map_btf_id = &bpf_arena_map_btf_ids[0], }; static u64 clear_lo32(u64 val) { return val & ~(u64)~0U; } /* * Allocate pages and vmap them into kernel vmalloc area. * Later the pages will be mmaped into user space vma. */ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id) { /* user_vm_end/start are fixed before bpf prog runs */ long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT; u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena); struct page **pages; long pgoff = 0; u32 uaddr32; int ret, i; if (page_cnt > page_cnt_max) return 0; if (uaddr) { if (uaddr & ~PAGE_MASK) return 0; pgoff = compute_pgoff(arena, uaddr); if (pgoff > page_cnt_max - page_cnt) /* requested address will be outside of user VMA */ return 0; } /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */ pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL); if (!pages) return 0; guard(mutex)(&arena->lock); if (uaddr) { ret = is_range_tree_set(&arena->rt, pgoff, page_cnt); if (ret) goto out_free_pages; ret = range_tree_clear(&arena->rt, pgoff, page_cnt); } else { ret = pgoff = range_tree_find(&arena->rt, page_cnt); if (pgoff >= 0) ret = range_tree_clear(&arena->rt, pgoff, page_cnt); } if (ret) goto out_free_pages; ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages); if (ret) goto out; uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE); /* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1 * will not overflow 32-bit. Lower 32-bit need to represent * contiguous user address range. * Map these pages at kern_vm_start base. * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow * lower 32-bit and it's ok. */ ret = vm_area_map_pages(arena->kern_vm, kern_vm_start + uaddr32, kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE, pages); if (ret) { for (i = 0; i < page_cnt; i++) __free_page(pages[i]); goto out; } kvfree(pages); return clear_lo32(arena->user_vm_start) + uaddr32; out: range_tree_set(&arena->rt, pgoff, page_cnt); out_free_pages: kvfree(pages); return 0; } /* * If page is present in vmalloc area, unmap it from vmalloc area, * unmap it from all user space vma-s, * and free it. */ static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt) { struct vma_list *vml; list_for_each_entry(vml, &arena->vma_list, head) zap_page_range_single(vml->vma, uaddr, PAGE_SIZE * page_cnt, NULL); } static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt) { u64 full_uaddr, uaddr_end; long kaddr, pgoff, i; struct page *page; /* only aligned lower 32-bit are relevant */ uaddr = (u32)uaddr; uaddr &= PAGE_MASK; full_uaddr = clear_lo32(arena->user_vm_start) + uaddr; uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT)); if (full_uaddr >= uaddr_end) return; page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT; guard(mutex)(&arena->lock); pgoff = compute_pgoff(arena, uaddr); /* clear range */ range_tree_set(&arena->rt, pgoff, page_cnt); if (page_cnt > 1) /* bulk zap if multiple pages being freed */ zap_pages(arena, full_uaddr, page_cnt); kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr; for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) { page = vmalloc_to_page((void *)kaddr); if (!page) continue; if (page_cnt == 1 && page_mapped(page)) /* mapped by some user process */ /* Optimization for the common case of page_cnt==1: * If page wasn't mapped into some user vma there * is no need to call zap_pages which is slow. When * page_cnt is big it's faster to do the batched zap. */ zap_pages(arena, full_uaddr, 1); vm_area_unmap_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE); __free_page(page); } } /* * Reserve an arena virtual address range without populating it. This call stops * bpf_arena_alloc_pages from adding pages to this range. */ static int arena_reserve_pages(struct bpf_arena *arena, long uaddr, u32 page_cnt) { long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT; long pgoff; int ret; if (uaddr & ~PAGE_MASK) return 0; pgoff = compute_pgoff(arena, uaddr); if (pgoff + page_cnt > page_cnt_max) return -EINVAL; guard(mutex)(&arena->lock); /* Cannot guard already allocated pages. */ ret = is_range_tree_set(&arena->rt, pgoff, page_cnt); if (ret) return -EBUSY; /* "Allocate" the region to prevent it from being allocated. */ return range_tree_clear(&arena->rt, pgoff, page_cnt); } __bpf_kfunc_start_defs(); __bpf_kfunc void *bpf_arena_alloc_pages(void *p__map, void *addr__ign, u32 page_cnt, int node_id, u64 flags) { struct bpf_map *map = p__map; struct bpf_arena *arena = container_of(map, struct bpf_arena, map); if (map->map_type != BPF_MAP_TYPE_ARENA || flags || !page_cnt) return NULL; return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id); } __bpf_kfunc void bpf_arena_free_pages(void *p__map, void *ptr__ign, u32 page_cnt) { struct bpf_map *map = p__map; struct bpf_arena *arena = container_of(map, struct bpf_arena, map); if (map->map_type != BPF_MAP_TYPE_ARENA || !page_cnt || !ptr__ign) return; arena_free_pages(arena, (long)ptr__ign, page_cnt); } __bpf_kfunc int bpf_arena_reserve_pages(void *p__map, void *ptr__ign, u32 page_cnt) { struct bpf_map *map = p__map; struct bpf_arena *arena = container_of(map, struct bpf_arena, map); if (map->map_type != BPF_MAP_TYPE_ARENA) return -EINVAL; if (!page_cnt) return 0; return arena_reserve_pages(arena, (long)ptr__ign, page_cnt); } __bpf_kfunc_end_defs(); BTF_KFUNCS_START(arena_kfuncs) BTF_ID_FLAGS(func, bpf_arena_alloc_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_RET | KF_ARENA_ARG2) BTF_ID_FLAGS(func, bpf_arena_free_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_ARG2) BTF_ID_FLAGS(func, bpf_arena_reserve_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_ARG2) BTF_KFUNCS_END(arena_kfuncs) static const struct btf_kfunc_id_set common_kfunc_set = { .owner = THIS_MODULE, .set = &arena_kfuncs, }; static int __init kfunc_init(void) { return register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC, &common_kfunc_set); } late_initcall(kfunc_init); |
| 6 6 6 6 6 6 6 6 6 6 6 6 6 485 486 4 6 6 6 6 7 7 7 6 1 9 9 9 4 4 4 6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 | // SPDX-License-Identifier: GPL-2.0-only /* * net/sunrpc/rpc_pipe.c * * Userland/kernel interface for rpcauth_gss. * Code shamelessly plagiarized from fs/nfsd/nfsctl.c * and fs/sysfs/inode.c * * Copyright (c) 2002, Trond Myklebust <trond.myklebust@fys.uio.no> * */ #include <linux/module.h> #include <linux/slab.h> #include <linux/string.h> #include <linux/pagemap.h> #include <linux/mount.h> #include <linux/fs_context.h> #include <linux/namei.h> #include <linux/fsnotify.h> #include <linux/kernel.h> #include <linux/rcupdate.h> #include <linux/utsname.h> #include <asm/ioctls.h> #include <linux/poll.h> #include <linux/wait.h> #include <linux/seq_file.h> #include <linux/sunrpc/clnt.h> #include <linux/workqueue.h> #include <linux/sunrpc/rpc_pipe_fs.h> #include <linux/sunrpc/cache.h> #include <linux/nsproxy.h> #include <linux/notifier.h> #include "netns.h" #include "sunrpc.h" #define RPCDBG_FACILITY RPCDBG_DEBUG #define NET_NAME(net) ((net == &init_net) ? " (init_net)" : "") static struct file_system_type rpc_pipe_fs_type; static const struct rpc_pipe_ops gssd_dummy_pipe_ops; static struct kmem_cache *rpc_inode_cachep __read_mostly; #define RPC_UPCALL_TIMEOUT (30*HZ) static BLOCKING_NOTIFIER_HEAD(rpc_pipefs_notifier_list); int rpc_pipefs_notifier_register(struct notifier_block *nb) { return blocking_notifier_chain_register(&rpc_pipefs_notifier_list, nb); } EXPORT_SYMBOL_GPL(rpc_pipefs_notifier_register); void rpc_pipefs_notifier_unregister(struct notifier_block *nb) { blocking_notifier_chain_unregister(&rpc_pipefs_notifier_list, nb); } EXPORT_SYMBOL_GPL(rpc_pipefs_notifier_unregister); static void rpc_purge_list(wait_queue_head_t *waitq, struct list_head *head, void (*destroy_msg)(struct rpc_pipe_msg *), int err) { struct rpc_pipe_msg *msg; if (list_empty(head)) return; do { msg = list_entry(head->next, struct rpc_pipe_msg, list); list_del_init(&msg->list); msg->errno = err; destroy_msg(msg); } while (!list_empty(head)); if (waitq) wake_up(waitq); } static void rpc_timeout_upcall_queue(struct work_struct *work) { LIST_HEAD(free_list); struct rpc_pipe *pipe = container_of(work, struct rpc_pipe, queue_timeout.work); void (*destroy_msg)(struct rpc_pipe_msg *); struct dentry *dentry; spin_lock(&pipe->lock); destroy_msg = pipe->ops->destroy_msg; if (pipe->nreaders == 0) { list_splice_init(&pipe->pipe, &free_list); pipe->pipelen = 0; } dentry = dget(pipe->dentry); spin_unlock(&pipe->lock); rpc_purge_list(dentry ? &RPC_I(d_inode(dentry))->waitq : NULL, &free_list, destroy_msg, -ETIMEDOUT); dput(dentry); } ssize_t rpc_pipe_generic_upcall(struct file *filp, struct rpc_pipe_msg *msg, char __user *dst, size_t buflen) { char *data = (char *)msg->data + msg->copied; size_t mlen = min(msg->len - msg->copied, buflen); unsigned long left; left = copy_to_user(dst, data, mlen); if (left == mlen) { msg->errno = -EFAULT; return -EFAULT; } mlen -= left; msg->copied += mlen; msg->errno = 0; return mlen; } EXPORT_SYMBOL_GPL(rpc_pipe_generic_upcall); /** * rpc_queue_upcall - queue an upcall message to userspace * @pipe: upcall pipe on which to queue given message * @msg: message to queue * * Call with an @inode created by rpc_mkpipe() to queue an upcall. * A userspace process may then later read the upcall by performing a * read on an open file for this inode. It is up to the caller to * initialize the fields of @msg (other than @msg->list) appropriately. */ int rpc_queue_upcall(struct rpc_pipe *pipe, struct rpc_pipe_msg *msg) { int res = -EPIPE; struct dentry *dentry; spin_lock(&pipe->lock); if (pipe->nreaders) { list_add_tail(&msg->list, &pipe->pipe); pipe->pipelen += msg->len; res = 0; } else if (pipe->flags & RPC_PIPE_WAIT_FOR_OPEN) { if (list_empty(&pipe->pipe)) queue_delayed_work(rpciod_workqueue, &pipe->queue_timeout, RPC_UPCALL_TIMEOUT); list_add_tail(&msg->list, &pipe->pipe); pipe->pipelen += msg->len; res = 0; } dentry = dget(pipe->dentry); spin_unlock(&pipe->lock); if (dentry) { wake_up(&RPC_I(d_inode(dentry))->waitq); dput(dentry); } return res; } EXPORT_SYMBOL_GPL(rpc_queue_upcall); static inline void rpc_inode_setowner(struct inode *inode, void *private) { RPC_I(inode)->private = private; } static void rpc_close_pipes(struct dentry *dentry) { struct inode *inode = dentry->d_inode; struct rpc_pipe *pipe = RPC_I(inode)->pipe; int need_release; LIST_HEAD(free_list); inode_lock(inode); spin_lock(&pipe->lock); need_release = pipe->nreaders != 0 || pipe->nwriters != 0; pipe->nreaders = 0; list_splice_init(&pipe->in_upcall, &free_list); list_splice_init(&pipe->pipe, &free_list); pipe->pipelen = 0; pipe->dentry = NULL; spin_unlock(&pipe->lock); rpc_purge_list(&RPC_I(inode)->waitq, &free_list, pipe->ops->destroy_msg, -EPIPE); pipe->nwriters = 0; if (need_release && pipe->ops->release_pipe) pipe->ops->release_pipe(inode); cancel_delayed_work_sync(&pipe->queue_timeout); rpc_inode_setowner(inode, NULL); RPC_I(inode)->pipe = NULL; inode_unlock(inode); } static struct inode * rpc_alloc_inode(struct super_block *sb) { struct rpc_inode *rpci; rpci = alloc_inode_sb(sb, rpc_inode_cachep, GFP_KERNEL); if (!rpci) return NULL; return &rpci->vfs_inode; } static void rpc_free_inode(struct inode *inode) { kmem_cache_free(rpc_inode_cachep, RPC_I(inode)); } static int rpc_pipe_open(struct inode *inode, struct file *filp) { struct rpc_pipe *pipe; int first_open; int res = -ENXIO; inode_lock(inode); pipe = RPC_I(inode)->pipe; if (pipe == NULL) goto out; first_open = pipe->nreaders == 0 && pipe->nwriters == 0; if (first_open && pipe->ops->open_pipe) { res = pipe->ops->open_pipe(inode); if (res) goto out; } if (filp->f_mode & FMODE_READ) pipe->nreaders++; if (filp->f_mode & FMODE_WRITE) pipe->nwriters++; res = 0; out: inode_unlock(inode); return res; } static int rpc_pipe_release(struct inode *inode, struct file *filp) { struct rpc_pipe *pipe; struct rpc_pipe_msg *msg; int last_close; inode_lock(inode); pipe = RPC_I(inode)->pipe; if (pipe == NULL) goto out; msg = filp->private_data; if (msg != NULL) { spin_lock(&pipe->lock); msg->errno = -EAGAIN; list_del_init(&msg->list); spin_unlock(&pipe->lock); pipe->ops->destroy_msg(msg); } if (filp->f_mode & FMODE_WRITE) pipe->nwriters --; if (filp->f_mode & FMODE_READ) { pipe->nreaders --; if (pipe->nreaders == 0) { LIST_HEAD(free_list); spin_lock(&pipe->lock); list_splice_init(&pipe->pipe, &free_list); pipe->pipelen = 0; spin_unlock(&pipe->lock); rpc_purge_list(&RPC_I(inode)->waitq, &free_list, pipe->ops->destroy_msg, -EAGAIN); } } last_close = pipe->nwriters == 0 && pipe->nreaders == 0; if (last_close && pipe->ops->release_pipe) pipe->ops->release_pipe(inode); out: inode_unlock(inode); return 0; } static ssize_t rpc_pipe_read(struct file *filp, char __user *buf, size_t len, loff_t *offset) { struct inode *inode = file_inode(filp); struct rpc_pipe *pipe; struct rpc_pipe_msg *msg; int res = 0; inode_lock(inode); pipe = RPC_I(inode)->pipe; if (pipe == NULL) { res = -EPIPE; goto out_unlock; } msg = filp->private_data; if (msg == NULL) { spin_lock(&pipe->lock); if (!list_empty(&pipe->pipe)) { msg = list_entry(pipe->pipe.next, struct rpc_pipe_msg, list); list_move(&msg->list, &pipe->in_upcall); pipe->pipelen -= msg->len; filp->private_data = msg; msg->copied = 0; } spin_unlock(&pipe->lock); if (msg == NULL) goto out_unlock; } /* NOTE: it is up to the callback to update msg->copied */ res = pipe->ops->upcall(filp, msg, buf, len); if (res < 0 || msg->len == msg->copied) { filp->private_data = NULL; spin_lock(&pipe->lock); list_del_init(&msg->list); spin_unlock(&pipe->lock); pipe->ops->destroy_msg(msg); } out_unlock: inode_unlock(inode); return res; } static ssize_t rpc_pipe_write(struct file *filp, const char __user *buf, size_t len, loff_t *offset) { struct inode *inode = file_inode(filp); int res; inode_lock(inode); res = -EPIPE; if (RPC_I(inode)->pipe != NULL) res = RPC_I(inode)->pipe->ops->downcall(filp, buf, len); inode_unlock(inode); return res; } static __poll_t rpc_pipe_poll(struct file *filp, struct poll_table_struct *wait) { struct inode *inode = file_inode(filp); struct rpc_inode *rpci = RPC_I(inode); __poll_t mask = EPOLLOUT | EPOLLWRNORM; poll_wait(filp, &rpci->waitq, wait); inode_lock(inode); if (rpci->pipe == NULL) mask |= EPOLLERR | EPOLLHUP; else if (filp->private_data || !list_empty(&rpci->pipe->pipe)) mask |= EPOLLIN | EPOLLRDNORM; inode_unlock(inode); return mask; } static long rpc_pipe_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { struct inode *inode = file_inode(filp); struct rpc_pipe *pipe; int len; switch (cmd) { case FIONREAD: inode_lock(inode); pipe = RPC_I(inode)->pipe; if (pipe == NULL) { inode_unlock(inode); return -EPIPE; } spin_lock(&pipe->lock); len = pipe->pipelen; if (filp->private_data) { struct rpc_pipe_msg *msg; msg = filp->private_data; len += msg->len - msg->copied; } spin_unlock(&pipe->lock); inode_unlock(inode); return put_user(len, (int __user *)arg); default: return -EINVAL; } } static const struct file_operations rpc_pipe_fops = { .owner = THIS_MODULE, .read = rpc_pipe_read, .write = rpc_pipe_write, .poll = rpc_pipe_poll, .unlocked_ioctl = rpc_pipe_ioctl, .open = rpc_pipe_open, .release = rpc_pipe_release, }; static int rpc_show_info(struct seq_file *m, void *v) { struct rpc_clnt *clnt = m->private; rcu_read_lock(); seq_printf(m, "RPC server: %s\n", rcu_dereference(clnt->cl_xprt)->servername); seq_printf(m, "service: %s (%d) version %d\n", clnt->cl_program->name, clnt->cl_prog, clnt->cl_vers); seq_printf(m, "address: %s\n", rpc_peeraddr2str(clnt, RPC_DISPLAY_ADDR)); seq_printf(m, "protocol: %s\n", rpc_peeraddr2str(clnt, RPC_DISPLAY_PROTO)); seq_printf(m, "port: %s\n", rpc_peeraddr2str(clnt, RPC_DISPLAY_PORT)); rcu_read_unlock(); return 0; } static int rpc_info_open(struct inode *inode, struct file *file) { struct rpc_clnt *clnt = NULL; int ret = single_open(file, rpc_show_info, NULL); if (!ret) { struct seq_file *m = file->private_data; spin_lock(&file->f_path.dentry->d_lock); if (!d_unhashed(file->f_path.dentry)) clnt = RPC_I(inode)->private; if (clnt != NULL && refcount_inc_not_zero(&clnt->cl_count)) { spin_unlock(&file->f_path.dentry->d_lock); m->private = clnt; } else { spin_unlock(&file->f_path.dentry->d_lock); single_release(inode, file); ret = -EINVAL; } } return ret; } static int rpc_info_release(struct inode *inode, struct file *file) { struct seq_file *m = file->private_data; struct rpc_clnt *clnt = (struct rpc_clnt *)m->private; if (clnt) rpc_release_client(clnt); return single_release(inode, file); } static const struct file_operations rpc_info_operations = { .owner = THIS_MODULE, .open = rpc_info_open, .read = seq_read, .llseek = seq_lseek, .release = rpc_info_release, }; /* * Description of fs contents. */ struct rpc_filelist { const char *name; const struct file_operations *i_fop; umode_t mode; }; static struct inode * rpc_get_inode(struct super_block *sb, umode_t mode) { struct inode *inode = new_inode(sb); if (!inode) return NULL; inode->i_ino = get_next_ino(); inode->i_mode = mode; simple_inode_init_ts(inode); switch (mode & S_IFMT) { case S_IFDIR: inode->i_fop = &simple_dir_operations; inode->i_op = &simple_dir_inode_operations; inc_nlink(inode); break; default: break; } return inode; } static void init_pipe(struct rpc_pipe *pipe) { pipe->nreaders = 0; pipe->nwriters = 0; INIT_LIST_HEAD(&pipe->in_upcall); INIT_LIST_HEAD(&pipe->in_downcall); INIT_LIST_HEAD(&pipe->pipe); pipe->pipelen = 0; INIT_DELAYED_WORK(&pipe->queue_timeout, rpc_timeout_upcall_queue); pipe->ops = NULL; spin_lock_init(&pipe->lock); pipe->dentry = NULL; } void rpc_destroy_pipe_data(struct rpc_pipe *pipe) { kfree(pipe); } EXPORT_SYMBOL_GPL(rpc_destroy_pipe_data); struct rpc_pipe *rpc_mkpipe_data(const struct rpc_pipe_ops *ops, int flags) { struct rpc_pipe *pipe; pipe = kzalloc(sizeof(struct rpc_pipe), GFP_KERNEL); if (!pipe) return ERR_PTR(-ENOMEM); init_pipe(pipe); pipe->ops = ops; pipe->flags = flags; return pipe; } EXPORT_SYMBOL_GPL(rpc_mkpipe_data); static int rpc_new_file(struct dentry *parent, const char *name, umode_t mode, const struct file_operations *i_fop, void *private) { struct dentry *dentry = simple_start_creating(parent, name); struct inode *dir = parent->d_inode; struct inode *inode; if (IS_ERR(dentry)) return PTR_ERR(dentry); inode = rpc_get_inode(dir->i_sb, S_IFREG | mode); if (unlikely(!inode)) { dput(dentry); inode_unlock(dir); return -ENOMEM; } inode->i_ino = iunique(dir->i_sb, 100); if (i_fop) inode->i_fop = i_fop; rpc_inode_setowner(inode, private); d_instantiate(dentry, inode); fsnotify_create(dir, dentry); inode_unlock(dir); return 0; } static struct dentry *rpc_new_dir(struct dentry *parent, const char *name, umode_t mode) { struct dentry *dentry = simple_start_creating(parent, name); struct inode *dir = parent->d_inode; struct inode *inode; if (IS_ERR(dentry)) return dentry; inode = rpc_get_inode(dir->i_sb, S_IFDIR | mode); if (unlikely(!inode)) { dput(dentry); inode_unlock(dir); return ERR_PTR(-ENOMEM); } inode->i_ino = iunique(dir->i_sb, 100); inc_nlink(dir); d_instantiate(dentry, inode); fsnotify_mkdir(dir, dentry); inode_unlock(dir); return dentry; } static int rpc_populate(struct dentry *parent, const struct rpc_filelist *files, int start, int eof, void *private) { struct dentry *dentry; int i, err; for (i = start; i < eof; i++) { switch (files[i].mode & S_IFMT) { default: BUG(); case S_IFREG: err = rpc_new_file(parent, files[i].name, files[i].mode, files[i].i_fop, private); if (err) goto out_bad; break; case S_IFDIR: dentry = rpc_new_dir(parent, files[i].name, files[i].mode); if (IS_ERR(dentry)) { err = PTR_ERR(dentry); goto out_bad; } } } return 0; out_bad: printk(KERN_WARNING "%s: %s failed to populate directory %pd\n", __FILE__, __func__, parent); return err; } /** * rpc_mkpipe_dentry - make an rpc_pipefs file for kernel<->userspace * communication * @parent: dentry of directory to create new "pipe" in * @name: name of pipe * @private: private data to associate with the pipe, for the caller's use * @pipe: &rpc_pipe containing input parameters * * Data is made available for userspace to read by calls to * rpc_queue_upcall(). The actual reads will result in calls to * @ops->upcall, which will be called with the file pointer, * message, and userspace buffer to copy to. * * Writes can come at any time, and do not necessarily have to be * responses to upcalls. They will result in calls to @msg->downcall. * * The @private argument passed here will be available to all these methods * from the file pointer, via RPC_I(file_inode(file))->private. */ int rpc_mkpipe_dentry(struct dentry *parent, const char *name, void *private, struct rpc_pipe *pipe) { struct inode *dir = d_inode(parent); struct dentry *dentry; struct inode *inode; struct rpc_inode *rpci; umode_t umode = S_IFIFO | 0600; int err; if (pipe->ops->upcall == NULL) umode &= ~0444; if (pipe->ops->downcall == NULL) umode &= ~0222; dentry = simple_start_creating(parent, name); if (IS_ERR(dentry)) { err = PTR_ERR(dentry); goto failed; } inode = rpc_get_inode(dir->i_sb, umode); if (unlikely(!inode)) { dput(dentry); inode_unlock(dir); err = -ENOMEM; goto failed; } inode->i_ino = iunique(dir->i_sb, 100); inode->i_fop = &rpc_pipe_fops; rpci = RPC_I(inode); rpci->private = private; rpci->pipe = pipe; rpc_inode_setowner(inode, private); d_instantiate(dentry, inode); pipe->dentry = dentry; fsnotify_create(dir, dentry); inode_unlock(dir); return 0; failed: pr_warn("%s() failed to create pipe %pd/%s (errno = %d)\n", __func__, parent, name, err); return err; } EXPORT_SYMBOL_GPL(rpc_mkpipe_dentry); /** * rpc_unlink - remove a pipe * @pipe: the pipe to be removed * * After this call, lookups will no longer find the pipe, and any * attempts to read or write using preexisting opens of the pipe will * return -EPIPE. */ void rpc_unlink(struct rpc_pipe *pipe) { if (pipe->dentry) { simple_recursive_removal(pipe->dentry, rpc_close_pipes); pipe->dentry = NULL; } } EXPORT_SYMBOL_GPL(rpc_unlink); /** * rpc_init_pipe_dir_head - initialise a struct rpc_pipe_dir_head * @pdh: pointer to struct rpc_pipe_dir_head */ void rpc_init_pipe_dir_head(struct rpc_pipe_dir_head *pdh) { INIT_LIST_HEAD(&pdh->pdh_entries); pdh->pdh_dentry = NULL; } EXPORT_SYMBOL_GPL(rpc_init_pipe_dir_head); /** * rpc_init_pipe_dir_object - initialise a struct rpc_pipe_dir_object * @pdo: pointer to struct rpc_pipe_dir_object * @pdo_ops: pointer to const struct rpc_pipe_dir_object_ops * @pdo_data: pointer to caller-defined data */ void rpc_init_pipe_dir_object(struct rpc_pipe_dir_object *pdo, const struct rpc_pipe_dir_object_ops *pdo_ops, void *pdo_data) { INIT_LIST_HEAD(&pdo->pdo_head); pdo->pdo_ops = pdo_ops; pdo->pdo_data = pdo_data; } EXPORT_SYMBOL_GPL(rpc_init_pipe_dir_object); static int rpc_add_pipe_dir_object_locked(struct net *net, struct rpc_pipe_dir_head *pdh, struct rpc_pipe_dir_object *pdo) { int ret = 0; if (pdh->pdh_dentry) ret = pdo->pdo_ops->create(pdh->pdh_dentry, pdo); if (ret == 0) list_add_tail(&pdo->pdo_head, &pdh->pdh_entries); return ret; } static void rpc_remove_pipe_dir_object_locked(struct net *net, struct rpc_pipe_dir_head *pdh, struct rpc_pipe_dir_object *pdo) { if (pdh->pdh_dentry) pdo->pdo_ops->destroy(pdh->pdh_dentry, pdo); list_del_init(&pdo->pdo_head); } /** * rpc_add_pipe_dir_object - associate a rpc_pipe_dir_object to a directory * @net: pointer to struct net * @pdh: pointer to struct rpc_pipe_dir_head * @pdo: pointer to struct rpc_pipe_dir_object * */ int rpc_add_pipe_dir_object(struct net *net, struct rpc_pipe_dir_head *pdh, struct rpc_pipe_dir_object *pdo) { int ret = 0; if (list_empty(&pdo->pdo_head)) { struct sunrpc_net *sn = net_generic(net, sunrpc_net_id); mutex_lock(&sn->pipefs_sb_lock); ret = rpc_add_pipe_dir_object_locked(net, pdh, pdo); mutex_unlock(&sn->pipefs_sb_lock); } return ret; } EXPORT_SYMBOL_GPL(rpc_add_pipe_dir_object); /** * rpc_remove_pipe_dir_object - remove a rpc_pipe_dir_object from a directory * @net: pointer to struct net * @pdh: pointer to struct rpc_pipe_dir_head * @pdo: pointer to struct rpc_pipe_dir_object * */ void rpc_remove_pipe_dir_object(struct net *net, struct rpc_pipe_dir_head *pdh, struct rpc_pipe_dir_object *pdo) { if (!list_empty(&pdo->pdo_head)) { struct sunrpc_net *sn = net_generic(net, sunrpc_net_id); mutex_lock(&sn->pipefs_sb_lock); rpc_remove_pipe_dir_object_locked(net, pdh, pdo); mutex_unlock(&sn->pipefs_sb_lock); } } EXPORT_SYMBOL_GPL(rpc_remove_pipe_dir_object); /** * rpc_find_or_alloc_pipe_dir_object * @net: pointer to struct net * @pdh: pointer to struct rpc_pipe_dir_head * @match: match struct rpc_pipe_dir_object to data * @alloc: allocate a new struct rpc_pipe_dir_object * @data: user defined data for match() and alloc() * */ struct rpc_pipe_dir_object * rpc_find_or_alloc_pipe_dir_object(struct net *net, struct rpc_pipe_dir_head *pdh, int (*match)(struct rpc_pipe_dir_object *, void *), struct rpc_pipe_dir_object *(*alloc)(void *), void *data) { struct sunrpc_net *sn = net_generic(net, sunrpc_net_id); struct rpc_pipe_dir_object *pdo; mutex_lock(&sn->pipefs_sb_lock); list_for_each_entry(pdo, &pdh->pdh_entries, pdo_head) { if (!match(pdo, data)) continue; goto out; } pdo = alloc(data); if (!pdo) goto out; rpc_add_pipe_dir_object_locked(net, pdh, pdo); out: mutex_unlock(&sn->pipefs_sb_lock); return pdo; } EXPORT_SYMBOL_GPL(rpc_find_or_alloc_pipe_dir_object); static void rpc_create_pipe_dir_objects(struct rpc_pipe_dir_head *pdh) { struct rpc_pipe_dir_object *pdo; struct dentry *dir = pdh->pdh_dentry; list_for_each_entry(pdo, &pdh->pdh_entries, pdo_head) pdo->pdo_ops->create(dir, pdo); } static void rpc_destroy_pipe_dir_objects(struct rpc_pipe_dir_head *pdh) { struct rpc_pipe_dir_object *pdo; struct dentry *dir = pdh->pdh_dentry; list_for_each_entry(pdo, &pdh->pdh_entries, pdo_head) pdo->pdo_ops->destroy(dir, pdo); } /** * rpc_create_client_dir - Create a new rpc_client directory in rpc_pipefs * @dentry: the parent of new directory * @name: the name of new directory * @rpc_client: rpc client to associate with this directory * * This creates a directory at the given @path associated with * @rpc_clnt, which will contain a file named "info" with some basic * information about the client, together with any "pipes" that may * later be created using rpc_mkpipe(). */ int rpc_create_client_dir(struct dentry *dentry, const char *name, struct rpc_clnt *rpc_client) { struct dentry *ret; int err; ret = rpc_new_dir(dentry, name, 0555); if (IS_ERR(ret)) return PTR_ERR(ret); err = rpc_new_file(ret, "info", S_IFREG | 0400, &rpc_info_operations, rpc_client); if (err) { pr_warn("%s failed to populate directory %pd\n", __func__, ret); simple_recursive_removal(ret, NULL); return err; } rpc_client->cl_pipedir_objects.pdh_dentry = ret; rpc_create_pipe_dir_objects(&rpc_client->cl_pipedir_objects); return 0; } /** * rpc_remove_client_dir - Remove a directory created with rpc_create_client_dir() * @rpc_client: rpc_client for the pipe */ int rpc_remove_client_dir(struct rpc_clnt *rpc_client) { struct dentry *dentry = rpc_client->cl_pipedir_objects.pdh_dentry; if (dentry == NULL) return 0; rpc_destroy_pipe_dir_objects(&rpc_client->cl_pipedir_objects); rpc_client->cl_pipedir_objects.pdh_dentry = NULL; simple_recursive_removal(dentry, NULL); return 0; } static const struct rpc_filelist cache_pipefs_files[3] = { [0] = { .name = "channel", .i_fop = &cache_file_operations_pipefs, .mode = S_IFREG | 0600, }, [1] = { .name = "content", .i_fop = &content_file_operations_pipefs, .mode = S_IFREG | 0400, }, [2] = { .name = "flush", .i_fop = &cache_flush_operations_pipefs, .mode = S_IFREG | 0600, }, }; struct dentry *rpc_create_cache_dir(struct dentry *parent, const char *name, umode_t umode, struct cache_detail *cd) { struct dentry *dentry; dentry = rpc_new_dir(parent, name, umode); if (!IS_ERR(dentry)) { int error = rpc_populate(dentry, cache_pipefs_files, 0, 3, cd); if (error) { simple_recursive_removal(dentry, NULL); return ERR_PTR(error); } } return dentry; } void rpc_remove_cache_dir(struct dentry *dentry) { simple_recursive_removal(dentry, NULL); } /* * populate the filesystem */ static const struct super_operations s_ops = { .alloc_inode = rpc_alloc_inode, .free_inode = rpc_free_inode, .statfs = simple_statfs, }; #define RPCAUTH_GSSMAGIC 0x67596969 /* * We have a single directory with 1 node in it. */ enum { RPCAUTH_lockd, RPCAUTH_mount, RPCAUTH_nfs, RPCAUTH_portmap, RPCAUTH_statd, RPCAUTH_nfsd4_cb, RPCAUTH_cache, RPCAUTH_nfsd, RPCAUTH_RootEOF }; static const struct rpc_filelist files[] = { [RPCAUTH_lockd] = { .name = "lockd", .mode = S_IFDIR | 0555, }, [RPCAUTH_mount] = { .name = "mount", .mode = S_IFDIR | 0555, }, [RPCAUTH_nfs] = { .name = "nfs", .mode = S_IFDIR | 0555, }, [RPCAUTH_portmap] = { .name = "portmap", .mode = S_IFDIR | 0555, }, [RPCAUTH_statd] = { .name = "statd", .mode = S_IFDIR | 0555, }, [RPCAUTH_nfsd4_cb] = { .name = "nfsd4_cb", .mode = S_IFDIR | 0555, }, [RPCAUTH_cache] = { .name = "cache", .mode = S_IFDIR | 0555, }, [RPCAUTH_nfsd] = { .name = "nfsd", .mode = S_IFDIR | 0555, }, }; /* * This call can be used only in RPC pipefs mount notification hooks. */ struct dentry *rpc_d_lookup_sb(const struct super_block *sb, const unsigned char *dir_name) { return try_lookup_noperm(&QSTR(dir_name), sb->s_root); } EXPORT_SYMBOL_GPL(rpc_d_lookup_sb); int rpc_pipefs_init_net(struct net *net) { struct sunrpc_net *sn = net_generic(net, sunrpc_net_id); sn->gssd_dummy = rpc_mkpipe_data(&gssd_dummy_pipe_ops, 0); if (IS_ERR(sn->gssd_dummy)) return PTR_ERR(sn->gssd_dummy); mutex_init(&sn->pipefs_sb_lock); sn->pipe_version = -1; return 0; } void rpc_pipefs_exit_net(struct net *net) { struct sunrpc_net *sn = net_generic(net, sunrpc_net_id); rpc_destroy_pipe_data(sn->gssd_dummy); } /* * This call will be used for per network namespace operations calls. * Note: Function will be returned with pipefs_sb_lock taken if superblock was * found. This lock have to be released by rpc_put_sb_net() when all operations * will be completed. */ struct super_block *rpc_get_sb_net(const struct net *net) { struct sunrpc_net *sn = net_generic(net, sunrpc_net_id); mutex_lock(&sn->pipefs_sb_lock); if (sn->pipefs_sb) return sn->pipefs_sb; mutex_unlock(&sn->pipefs_sb_lock); return NULL; } EXPORT_SYMBOL_GPL(rpc_get_sb_net); void rpc_put_sb_net(const struct net *net) { struct sunrpc_net *sn = net_generic(net, sunrpc_net_id); WARN_ON(sn->pipefs_sb == NULL); mutex_unlock(&sn->pipefs_sb_lock); } EXPORT_SYMBOL_GPL(rpc_put_sb_net); static ssize_t dummy_downcall(struct file *filp, const char __user *src, size_t len) { return -EINVAL; } static const struct rpc_pipe_ops gssd_dummy_pipe_ops = { .upcall = rpc_pipe_generic_upcall, .downcall = dummy_downcall, }; /* * Here we present a bogus "info" file to keep rpc.gssd happy. We don't expect * that it will ever use this info to handle an upcall, but rpc.gssd expects * that this file will be there and have a certain format. */ static int rpc_dummy_info_show(struct seq_file *m, void *v) { seq_printf(m, "RPC server: %s\n", utsname()->nodename); seq_printf(m, "service: foo (1) version 0\n"); seq_printf(m, "address: 127.0.0.1\n"); seq_printf(m, "protocol: tcp\n"); seq_printf(m, "port: 0\n"); return 0; } DEFINE_SHOW_ATTRIBUTE(rpc_dummy_info); /** * rpc_gssd_dummy_populate - create a dummy gssd pipe * @root: root of the rpc_pipefs filesystem * @pipe_data: pipe data created when netns is initialized * * Create a dummy set of directories and a pipe that gssd can hold open to * indicate that it is up and running. */ static int rpc_gssd_dummy_populate(struct dentry *root, struct rpc_pipe *pipe_data) { struct dentry *gssd_dentry, *clnt_dentry; int err; gssd_dentry = rpc_new_dir(root, "gssd", 0555); if (IS_ERR(gssd_dentry)) return -ENOENT; clnt_dentry = rpc_new_dir(gssd_dentry, "clntXX", 0555); if (IS_ERR(clnt_dentry)) return -ENOENT; err = rpc_new_file(clnt_dentry, "info", 0400, &rpc_dummy_info_fops, NULL); if (!err) err = rpc_mkpipe_dentry(clnt_dentry, "gssd", NULL, pipe_data); return err; } static int rpc_fill_super(struct super_block *sb, struct fs_context *fc) { struct inode *inode; struct dentry *root; struct net *net = sb->s_fs_info; struct sunrpc_net *sn = net_generic(net, sunrpc_net_id); int err; sb->s_blocksize = PAGE_SIZE; sb->s_blocksize_bits = PAGE_SHIFT; sb->s_magic = RPCAUTH_GSSMAGIC; sb->s_op = &s_ops; sb->s_d_flags = DCACHE_DONTCACHE; sb->s_time_gran = 1; inode = rpc_get_inode(sb, S_IFDIR | 0555); sb->s_root = root = d_make_root(inode); if (!root) return -ENOMEM; if (rpc_populate(root, files, RPCAUTH_lockd, RPCAUTH_RootEOF, NULL)) return -ENOMEM; err = rpc_gssd_dummy_populate(root, sn->gssd_dummy); if (err) return err; dprintk("RPC: sending pipefs MOUNT notification for net %x%s\n", net->ns.inum, NET_NAME(net)); mutex_lock(&sn->pipefs_sb_lock); sn->pipefs_sb = sb; err = blocking_notifier_call_chain(&rpc_pipefs_notifier_list, RPC_PIPEFS_MOUNT, sb); mutex_unlock(&sn->pipefs_sb_lock); return err; } bool gssd_running(struct net *net) { struct sunrpc_net *sn = net_generic(net, sunrpc_net_id); struct rpc_pipe *pipe = sn->gssd_dummy; return pipe->nreaders || pipe->nwriters; } EXPORT_SYMBOL_GPL(gssd_running); static int rpc_fs_get_tree(struct fs_context *fc) { return get_tree_keyed(fc, rpc_fill_super, get_net(fc->net_ns)); } static void rpc_fs_free_fc(struct fs_context *fc) { if (fc->s_fs_info) put_net(fc->s_fs_info); } static const struct fs_context_operations rpc_fs_context_ops = { .free = rpc_fs_free_fc, .get_tree = rpc_fs_get_tree, }; static int rpc_init_fs_context(struct fs_context *fc) { put_user_ns(fc->user_ns); fc->user_ns = get_user_ns(fc->net_ns->user_ns); fc->ops = &rpc_fs_context_ops; return 0; } static void rpc_kill_sb(struct super_block *sb) { struct net *net = sb->s_fs_info; struct sunrpc_net *sn = net_generic(net, sunrpc_net_id); mutex_lock(&sn->pipefs_sb_lock); if (sn->pipefs_sb != sb) { mutex_unlock(&sn->pipefs_sb_lock); goto out; } sn->pipefs_sb = NULL; dprintk("RPC: sending pipefs UMOUNT notification for net %x%s\n", net->ns.inum, NET_NAME(net)); blocking_notifier_call_chain(&rpc_pipefs_notifier_list, RPC_PIPEFS_UMOUNT, sb); mutex_unlock(&sn->pipefs_sb_lock); out: kill_litter_super(sb); put_net(net); } static struct file_system_type rpc_pipe_fs_type = { .owner = THIS_MODULE, .name = "rpc_pipefs", .init_fs_context = rpc_init_fs_context, .kill_sb = rpc_kill_sb, }; MODULE_ALIAS_FS("rpc_pipefs"); MODULE_ALIAS("rpc_pipefs"); static void init_once(void *foo) { struct rpc_inode *rpci = (struct rpc_inode *) foo; inode_init_once(&rpci->vfs_inode); rpci->private = NULL; rpci->pipe = NULL; init_waitqueue_head(&rpci->waitq); } int register_rpc_pipefs(void) { int err; rpc_inode_cachep = kmem_cache_create("rpc_inode_cache", sizeof(struct rpc_inode), 0, (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT| SLAB_ACCOUNT), init_once); if (!rpc_inode_cachep) return -ENOMEM; err = rpc_clients_notifier_register(); if (err) goto err_notifier; err = register_filesystem(&rpc_pipe_fs_type); if (err) goto err_register; return 0; err_register: rpc_clients_notifier_unregister(); err_notifier: kmem_cache_destroy(rpc_inode_cachep); return err; } void unregister_rpc_pipefs(void) { rpc_clients_notifier_unregister(); unregister_filesystem(&rpc_pipe_fs_type); kmem_cache_destroy(rpc_inode_cachep); } |
| 197 196 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Copyright (c) 2013 Red Hat, Inc. and Parallels Inc. All rights reserved. * Authors: David Chinner and Glauber Costa * * Generic LRU infrastructure */ #ifndef _LRU_LIST_H #define _LRU_LIST_H #include <linux/list.h> #include <linux/nodemask.h> #include <linux/shrinker.h> #include <linux/xarray.h> struct mem_cgroup; /* list_lru_walk_cb has to always return one of those */ enum lru_status { LRU_REMOVED, /* item removed from list */ LRU_REMOVED_RETRY, /* item removed, but lock has been dropped and reacquired */ LRU_ROTATE, /* item referenced, give another pass */ LRU_SKIP, /* item cannot be locked, skip */ LRU_RETRY, /* item not freeable. May drop the lock internally, but has to return locked. */ LRU_STOP, /* stop lru list walking. May drop the lock internally, but has to return locked. */ }; struct list_lru_one { struct list_head list; /* may become negative during memcg reparenting */ long nr_items; /* protects all fields above */ spinlock_t lock; }; struct list_lru_memcg { struct rcu_head rcu; /* array of per cgroup per node lists, indexed by node id */ struct list_lru_one node[]; }; struct list_lru_node { /* global list, used for the root cgroup in cgroup aware lrus */ struct list_lru_one lru; atomic_long_t nr_items; } ____cacheline_aligned_in_smp; struct list_lru { struct list_lru_node *node; #ifdef CONFIG_MEMCG struct list_head list; int shrinker_id; bool memcg_aware; struct xarray xa; #endif #ifdef CONFIG_LOCKDEP struct lock_class_key *key; #endif }; void list_lru_destroy(struct list_lru *lru); int __list_lru_init(struct list_lru *lru, bool memcg_aware, struct shrinker *shrinker); #define list_lru_init(lru) \ __list_lru_init((lru), false, NULL) #define list_lru_init_memcg(lru, shrinker) \ __list_lru_init((lru), true, shrinker) static inline int list_lru_init_memcg_key(struct list_lru *lru, struct shrinker *shrinker, struct lock_class_key *key) { #ifdef CONFIG_LOCKDEP lru->key = key; #endif return list_lru_init_memcg(lru, shrinker); } int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru, gfp_t gfp); void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *parent); /** * list_lru_add: add an element to the lru list's tail * @lru: the lru pointer * @item: the item to be added. * @nid: the node id of the sublist to add the item to. * @memcg: the cgroup of the sublist to add the item to. * * If the element is already part of a list, this function returns doing * nothing. This means that it is not necessary to keep state about whether or * not the element already belongs in the list. That said, this logic only * works if the item is in *this* list. If the item might be in some other * list, then you cannot rely on this check and you must remove it from the * other list before trying to insert it. * * The lru list consists of many sublists internally; the @nid and @memcg * parameters are used to determine which sublist to insert the item into. * It's important to use the right value of @nid and @memcg when deleting the * item, since it might otherwise get deleted from the wrong sublist. * * This also applies when attempting to insert the item multiple times - if * the item is currently in one sublist and you call list_lru_add() again, you * must pass the right @nid and @memcg parameters so that the same sublist is * used. * * You must ensure that the memcg is not freed during this call (e.g., with * rcu or by taking a css refcnt). * * Return: true if the list was updated, false otherwise */ bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid, struct mem_cgroup *memcg); /** * list_lru_add_obj: add an element to the lru list's tail * @lru: the lru pointer * @item: the item to be added. * * This function is similar to list_lru_add(), but the NUMA node and the * memcg of the sublist is determined by @item list_head. This assumption is * valid for slab objects LRU such as dentries, inodes, etc. * * Return: true if the list was updated, false otherwise */ bool list_lru_add_obj(struct list_lru *lru, struct list_head *item); /** * list_lru_del: delete an element from the lru list * @lru: the lru pointer * @item: the item to be deleted. * @nid: the node id of the sublist to delete the item from. * @memcg: the cgroup of the sublist to delete the item from. * * This function works analogously as list_lru_add() in terms of list * manipulation. * * The comments in list_lru_add() about an element already being in a list are * also valid for list_lru_del(), that is, you can delete an item that has * already been removed or never been added. However, if the item is in a * list, it must be in *this* list, and you must pass the right value of @nid * and @memcg so that the right sublist is used. * * You must ensure that the memcg is not freed during this call (e.g., with * rcu or by taking a css refcnt). When a memcg is deleted, list_lru entries * are automatically moved to the parent memcg. This is done in a race-free * way, so during deletion of an memcg both the old and new memcg will resolve * to the same sublist internally. * * Return: true if the list was updated, false otherwise */ bool list_lru_del(struct list_lru *lru, struct list_head *item, int nid, struct mem_cgroup *memcg); /** * list_lru_del_obj: delete an element from the lru list * @lru: the lru pointer * @item: the item to be deleted. * * This function is similar to list_lru_del(), but the NUMA node and the * memcg of the sublist is determined by @item list_head. This assumption is * valid for slab objects LRU such as dentries, inodes, etc. * * Return: true if the list was updated, false otherwise. */ bool list_lru_del_obj(struct list_lru *lru, struct list_head *item); /** * list_lru_count_one: return the number of objects currently held by @lru * @lru: the lru pointer. * @nid: the node id to count from. * @memcg: the cgroup to count from. * * There is no guarantee that the list is not updated while the count is being * computed. Callers that want such a guarantee need to provide an outer lock. * * Return: 0 for empty lists, otherwise the number of objects * currently held by @lru. */ unsigned long list_lru_count_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg); unsigned long list_lru_count_node(struct list_lru *lru, int nid); static inline unsigned long list_lru_shrink_count(struct list_lru *lru, struct shrink_control *sc) { return list_lru_count_one(lru, sc->nid, sc->memcg); } static inline unsigned long list_lru_count(struct list_lru *lru) { long count = 0; int nid; for_each_node_state(nid, N_NORMAL_MEMORY) count += list_lru_count_node(lru, nid); return count; } void list_lru_isolate(struct list_lru_one *list, struct list_head *item); void list_lru_isolate_move(struct list_lru_one *list, struct list_head *item, struct list_head *head); typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item, struct list_lru_one *list, void *cb_arg); /** * list_lru_walk_one: walk a @lru, isolating and disposing freeable items. * @lru: the lru pointer. * @nid: the node id to scan from. * @memcg: the cgroup to scan from. * @isolate: callback function that is responsible for deciding what to do with * the item currently being scanned * @cb_arg: opaque type that will be passed to @isolate * @nr_to_walk: how many items to scan. * * This function will scan all elements in a particular @lru, calling the * @isolate callback for each of those items, along with the current list * spinlock and a caller-provided opaque. The @isolate callback can choose to * drop the lock internally, but *must* return with the lock held. The callback * will return an enum lru_status telling the @lru infrastructure what to * do with the object being scanned. * * Please note that @nr_to_walk does not mean how many objects will be freed, * just how many objects will be scanned. * * Return: the number of objects effectively removed from the LRU. */ unsigned long list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg, list_lru_walk_cb isolate, void *cb_arg, unsigned long *nr_to_walk); /** * list_lru_walk_one_irq: walk a @lru, isolating and disposing freeable items. * @lru: the lru pointer. * @nid: the node id to scan from. * @memcg: the cgroup to scan from. * @isolate: callback function that is responsible for deciding what to do with * the item currently being scanned * @cb_arg: opaque type that will be passed to @isolate * @nr_to_walk: how many items to scan. * * Same as list_lru_walk_one() except that the spinlock is acquired with * spin_lock_irq(). */ unsigned long list_lru_walk_one_irq(struct list_lru *lru, int nid, struct mem_cgroup *memcg, list_lru_walk_cb isolate, void *cb_arg, unsigned long *nr_to_walk); unsigned long list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate, void *cb_arg, unsigned long *nr_to_walk); static inline unsigned long list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc, list_lru_walk_cb isolate, void *cb_arg) { return list_lru_walk_one(lru, sc->nid, sc->memcg, isolate, cb_arg, &sc->nr_to_scan); } static inline unsigned long list_lru_shrink_walk_irq(struct list_lru *lru, struct shrink_control *sc, list_lru_walk_cb isolate, void *cb_arg) { return list_lru_walk_one_irq(lru, sc->nid, sc->memcg, isolate, cb_arg, &sc->nr_to_scan); } static inline unsigned long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate, void *cb_arg, unsigned long nr_to_walk) { long isolated = 0; int nid; for_each_node_state(nid, N_NORMAL_MEMORY) { isolated += list_lru_walk_node(lru, nid, isolate, cb_arg, &nr_to_walk); if (nr_to_walk <= 0) break; } return isolated; } #endif /* _LRU_LIST_H */ |
| 27 14 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | // SPDX-License-Identifier: GPL-2.0 #include <linux/bcd.h> #include <linux/export.h> unsigned _bcd2bin(unsigned char val) { return (val & 0x0f) + (val >> 4) * 10; } EXPORT_SYMBOL(_bcd2bin); unsigned char _bin2bcd(unsigned val) { const unsigned int t = (val * 103) >> 10; return (t << 4) | (val - t * 10); } EXPORT_SYMBOL(_bin2bcd); |
| 62 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 | /* SPDX-License-Identifier: GPL-2.0-or-later */ /* * NET An implementation of the SOCKET network access protocol. * This is the master header file for the Linux NET layer, * or, in plain English: the networking handling part of the * kernel. * * Version: @(#)net.h 1.0.3 05/25/93 * * Authors: Orest Zborowski, <obz@Kodak.COM> * Ross Biro * Fred N. van Kempen, <waltje@uWalt.NL.Mugnet.ORG> */ #ifndef _LINUX_NET_H #define _LINUX_NET_H #include <linux/stringify.h> #include <linux/random.h> #include <linux/wait.h> #include <linux/fcntl.h> /* For O_CLOEXEC and O_NONBLOCK */ #include <linux/rcupdate.h> #include <linux/once.h> #include <linux/fs.h> #include <linux/mm.h> #include <linux/sockptr.h> #include <uapi/linux/net.h> struct poll_table_struct; struct pipe_inode_info; struct inode; struct file; struct net; /* Historically, SOCKWQ_ASYNC_NOSPACE & SOCKWQ_ASYNC_WAITDATA were located * in sock->flags, but moved into sk->sk_wq->flags to be RCU protected. * Eventually all flags will be in sk->sk_wq->flags. */ enum socket_flags { SOCKWQ_ASYNC_NOSPACE, SOCKWQ_ASYNC_WAITDATA, SOCK_NOSPACE, SOCK_SUPPORT_ZC, SOCK_CUSTOM_SOCKOPT, }; #ifndef ARCH_HAS_SOCKET_TYPES /** * enum sock_type - Socket types * @SOCK_STREAM: stream (connection) socket * @SOCK_DGRAM: datagram (conn.less) socket * @SOCK_RAW: raw socket * @SOCK_RDM: reliably-delivered message * @SOCK_SEQPACKET: sequential packet socket * @SOCK_DCCP: Datagram Congestion Control Protocol socket * @SOCK_PACKET: linux specific way of getting packets at the dev level. * For writing rarp and other similar things on the user level. * * When adding some new socket type please * grep ARCH_HAS_SOCKET_TYPE include/asm-* /socket.h, at least MIPS * overrides this enum for binary compat reasons. */ enum sock_type { SOCK_STREAM = 1, SOCK_DGRAM = 2, SOCK_RAW = 3, SOCK_RDM = 4, SOCK_SEQPACKET = 5, SOCK_DCCP = 6, SOCK_PACKET = 10, }; #endif /* ARCH_HAS_SOCKET_TYPES */ #define SOCK_MAX (SOCK_PACKET + 1) /* Mask which covers at least up to SOCK_MASK-1. The * remaining bits are used as flags. */ #define SOCK_TYPE_MASK 0xf /* Flags for socket, socketpair, accept4 */ #define SOCK_CLOEXEC O_CLOEXEC #ifndef SOCK_NONBLOCK #define SOCK_NONBLOCK O_NONBLOCK #endif #define SOCK_COREDUMP O_NOCTTY /** * enum sock_shutdown_cmd - Shutdown types * @SHUT_RD: shutdown receptions * @SHUT_WR: shutdown transmissions * @SHUT_RDWR: shutdown receptions/transmissions */ enum sock_shutdown_cmd { SHUT_RD, SHUT_WR, SHUT_RDWR, }; struct socket_wq { /* Note: wait MUST be first field of socket_wq */ wait_queue_head_t wait; struct fasync_struct *fasync_list; unsigned long flags; /* %SOCKWQ_ASYNC_NOSPACE, etc */ struct rcu_head rcu; } ____cacheline_aligned_in_smp; /** * struct socket - general BSD socket * @state: socket state (%SS_CONNECTED, etc) * @type: socket type (%SOCK_STREAM, etc) * @flags: socket flags (%SOCK_NOSPACE, etc) * @ops: protocol specific socket operations * @file: File back pointer for gc * @sk: internal networking protocol agnostic socket representation * @wq: wait queue for several uses */ struct socket { socket_state state; short type; unsigned long flags; struct file *file; struct sock *sk; const struct proto_ops *ops; /* Might change with IPV6_ADDRFORM or MPTCP. */ struct socket_wq wq; }; /* * "descriptor" for what we're up to with a read. * This allows us to use the same read code yet * have multiple different users of the data that * we read from a file. * * The simplest case just copies the data to user * mode. */ typedef struct { size_t written; size_t count; union { char __user *buf; void *data; } arg; int error; } read_descriptor_t; struct vm_area_struct; struct page; struct sockaddr; struct msghdr; struct module; struct sk_buff; struct proto_accept_arg; typedef int (*sk_read_actor_t)(read_descriptor_t *, struct sk_buff *, unsigned int, size_t); typedef int (*skb_read_actor_t)(struct sock *, struct sk_buff *); struct proto_ops { int family; struct module *owner; int (*release) (struct socket *sock); int (*bind) (struct socket *sock, struct sockaddr *myaddr, int sockaddr_len); int (*connect) (struct socket *sock, struct sockaddr *vaddr, int sockaddr_len, int flags); int (*socketpair)(struct socket *sock1, struct socket *sock2); int (*accept) (struct socket *sock, struct socket *newsock, struct proto_accept_arg *arg); int (*getname) (struct socket *sock, struct sockaddr *addr, int peer); __poll_t (*poll) (struct file *file, struct socket *sock, struct poll_table_struct *wait); int (*ioctl) (struct socket *sock, unsigned int cmd, unsigned long arg); #ifdef CONFIG_COMPAT int (*compat_ioctl) (struct socket *sock, unsigned int cmd, unsigned long arg); #endif int (*gettstamp) (struct socket *sock, void __user *userstamp, bool timeval, bool time32); int (*listen) (struct socket *sock, int len); int (*shutdown) (struct socket *sock, int flags); int (*setsockopt)(struct socket *sock, int level, int optname, sockptr_t optval, unsigned int optlen); int (*getsockopt)(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen); void (*show_fdinfo)(struct seq_file *m, struct socket *sock); int (*sendmsg) (struct socket *sock, struct msghdr *m, size_t total_len); /* Notes for implementing recvmsg: * =============================== * msg->msg_namelen should get updated by the recvmsg handlers * iff msg_name != NULL. It is by default 0 to prevent * returning uninitialized memory to user space. The recvfrom * handlers can assume that msg.msg_name is either NULL or has * a minimum size of sizeof(struct sockaddr_storage). */ int (*recvmsg) (struct socket *sock, struct msghdr *m, size_t total_len, int flags); int (*mmap) (struct file *file, struct socket *sock, struct vm_area_struct * vma); ssize_t (*splice_read)(struct socket *sock, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags); void (*splice_eof)(struct socket *sock); int (*set_peek_off)(struct sock *sk, int val); int (*peek_len)(struct socket *sock); /* The following functions are called internally by kernel with * sock lock already held. */ int (*read_sock)(struct sock *sk, read_descriptor_t *desc, sk_read_actor_t recv_actor); /* This is different from read_sock(), it reads an entire skb at a time. */ int (*read_skb)(struct sock *sk, skb_read_actor_t recv_actor); int (*sendmsg_locked)(struct sock *sk, struct msghdr *msg, size_t size); int (*set_rcvlowat)(struct sock *sk, int val); }; #define DECLARE_SOCKADDR(type, dst, src) \ type dst = ({ __sockaddr_check_size(sizeof(*dst)); (type) src; }) struct net_proto_family { int family; int (*create)(struct net *net, struct socket *sock, int protocol, int kern); struct module *owner; }; struct iovec; struct kvec; enum { SOCK_WAKE_IO, SOCK_WAKE_WAITD, SOCK_WAKE_SPACE, SOCK_WAKE_URG, }; int sock_wake_async(struct socket_wq *sk_wq, int how, int band); int sock_register(const struct net_proto_family *fam); void sock_unregister(int family); bool sock_is_registered(int family); int __sock_create(struct net *net, int family, int type, int proto, struct socket **res, int kern); int sock_create(int family, int type, int proto, struct socket **res); int sock_create_kern(struct net *net, int family, int type, int proto, struct socket **res); int sock_create_lite(int family, int type, int proto, struct socket **res); struct socket *sock_alloc(void); void sock_release(struct socket *sock); int sock_sendmsg(struct socket *sock, struct msghdr *msg); int sock_recvmsg(struct socket *sock, struct msghdr *msg, int flags); struct file *sock_alloc_file(struct socket *sock, int flags, const char *dname); struct socket *sockfd_lookup(int fd, int *err); struct socket *sock_from_file(struct file *file); #define sockfd_put(sock) fput(sock->file) int net_ratelimit(void); #define net_ratelimited_function(function, ...) \ do { \ if (net_ratelimit()) \ function(__VA_ARGS__); \ } while (0) #define net_emerg_ratelimited(fmt, ...) \ net_ratelimited_function(pr_emerg, fmt, ##__VA_ARGS__) #define net_alert_ratelimited(fmt, ...) \ net_ratelimited_function(pr_alert, fmt, ##__VA_ARGS__) #define net_crit_ratelimited(fmt, ...) \ net_ratelimited_function(pr_crit, fmt, ##__VA_ARGS__) #define net_err_ratelimited(fmt, ...) \ net_ratelimited_function(pr_err, fmt, ##__VA_ARGS__) #define net_notice_ratelimited(fmt, ...) \ net_ratelimited_function(pr_notice, fmt, ##__VA_ARGS__) #define net_warn_ratelimited(fmt, ...) \ net_ratelimited_function(pr_warn, fmt, ##__VA_ARGS__) #define net_info_ratelimited(fmt, ...) \ net_ratelimited_function(pr_info, fmt, ##__VA_ARGS__) #if defined(CONFIG_DYNAMIC_DEBUG) || \ (defined(CONFIG_DYNAMIC_DEBUG_CORE) && defined(DYNAMIC_DEBUG_MODULE)) #define net_dbg_ratelimited(fmt, ...) \ do { \ DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt); \ if (DYNAMIC_DEBUG_BRANCH(descriptor) && \ net_ratelimit()) \ __dynamic_pr_debug(&descriptor, pr_fmt(fmt), \ ##__VA_ARGS__); \ } while (0) #elif defined(DEBUG) #define net_dbg_ratelimited(fmt, ...) \ net_ratelimited_function(pr_debug, fmt, ##__VA_ARGS__) #else #define net_dbg_ratelimited(fmt, ...) \ no_printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__) #endif #define net_get_random_once(buf, nbytes) \ get_random_once((buf), (nbytes)) /* * E.g. XFS meta- & log-data is in slab pages, or bcache meta * data pages, or other high order pages allocated by * __get_free_pages() without __GFP_COMP, which have a page_count * of 0 and/or have PageSlab() set. We cannot use send_page for * those, as that does get_page(); put_page(); and would cause * either a VM_BUG directly, or __page_cache_release a page that * would actually still be referenced by someone, leading to some * obscure delayed Oops somewhere else. */ static inline bool sendpage_ok(struct page *page) { return !PageSlab(page) && page_count(page) >= 1; } /* * Check sendpage_ok on contiguous pages. */ static inline bool sendpages_ok(struct page *page, size_t len, size_t offset) { struct page *p = page + (offset >> PAGE_SHIFT); size_t count = 0; while (count < len) { if (!sendpage_ok(p)) return false; p++; count += PAGE_SIZE; } return true; } int kernel_sendmsg(struct socket *sock, struct msghdr *msg, struct kvec *vec, size_t num, size_t len); int kernel_recvmsg(struct socket *sock, struct msghdr *msg, struct kvec *vec, size_t num, size_t len, int flags); int kernel_bind(struct socket *sock, struct sockaddr *addr, int addrlen); int kernel_listen(struct socket *sock, int backlog); int kernel_accept(struct socket *sock, struct socket **newsock, int flags); int kernel_connect(struct socket *sock, struct sockaddr *addr, int addrlen, int flags); int kernel_getsockname(struct socket *sock, struct sockaddr *addr); int kernel_getpeername(struct socket *sock, struct sockaddr *addr); int kernel_sock_shutdown(struct socket *sock, enum sock_shutdown_cmd how); /* Routine returns the IP overhead imposed by a (caller-protected) socket. */ u32 kernel_sock_ip_overhead(struct sock *sk); #define MODULE_ALIAS_NETPROTO(proto) \ MODULE_ALIAS("net-pf-" __stringify(proto)) #define MODULE_ALIAS_NET_PF_PROTO(pf, proto) \ MODULE_ALIAS("net-pf-" __stringify(pf) "-proto-" __stringify(proto)) #define MODULE_ALIAS_NET_PF_PROTO_TYPE(pf, proto, type) \ MODULE_ALIAS("net-pf-" __stringify(pf) "-proto-" __stringify(proto) \ "-type-" __stringify(type)) #define MODULE_ALIAS_NET_PF_PROTO_NAME(pf, proto, name) \ MODULE_ALIAS("net-pf-" __stringify(pf) "-proto-" __stringify(proto) \ name) #endif /* _LINUX_NET_H */ |
| 17 3 16 14 16 2 15 2 15 16 2 15 22 22 21 21 21 21 21 6 3 2 3 1 9 2 4 4 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | // SPDX-License-Identifier: GPL-2.0-only #include <net/xdp_sock_drv.h> #include "netlink.h" #include "common.h" struct channels_req_info { struct ethnl_req_info base; }; struct channels_reply_data { struct ethnl_reply_data base; struct ethtool_channels channels; }; #define CHANNELS_REPDATA(__reply_base) \ container_of(__reply_base, struct channels_reply_data, base) const struct nla_policy ethnl_channels_get_policy[] = { [ETHTOOL_A_CHANNELS_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy), }; static int channels_prepare_data(const struct ethnl_req_info *req_base, struct ethnl_reply_data *reply_base, const struct genl_info *info) { struct channels_reply_data *data = CHANNELS_REPDATA(reply_base); struct net_device *dev = reply_base->dev; int ret; if (!dev->ethtool_ops->get_channels) return -EOPNOTSUPP; ret = ethnl_ops_begin(dev); if (ret < 0) return ret; dev->ethtool_ops->get_channels(dev, &data->channels); ethnl_ops_complete(dev); return 0; } static int channels_reply_size(const struct ethnl_req_info *req_base, const struct ethnl_reply_data *reply_base) { return nla_total_size(sizeof(u32)) + /* _CHANNELS_RX_MAX */ nla_total_size(sizeof(u32)) + /* _CHANNELS_TX_MAX */ nla_total_size(sizeof(u32)) + /* _CHANNELS_OTHER_MAX */ nla_total_size(sizeof(u32)) + /* _CHANNELS_COMBINED_MAX */ nla_total_size(sizeof(u32)) + /* _CHANNELS_RX_COUNT */ nla_total_size(sizeof(u32)) + /* _CHANNELS_TX_COUNT */ nla_total_size(sizeof(u32)) + /* _CHANNELS_OTHER_COUNT */ nla_total_size(sizeof(u32)); /* _CHANNELS_COMBINED_COUNT */ } static int channels_fill_reply(struct sk_buff *skb, const struct ethnl_req_info *req_base, const struct ethnl_reply_data *reply_base) { const struct channels_reply_data *data = CHANNELS_REPDATA(reply_base); const struct ethtool_channels *channels = &data->channels; if ((channels->max_rx && (nla_put_u32(skb, ETHTOOL_A_CHANNELS_RX_MAX, channels->max_rx) || nla_put_u32(skb, ETHTOOL_A_CHANNELS_RX_COUNT, channels->rx_count))) || (channels->max_tx && (nla_put_u32(skb, ETHTOOL_A_CHANNELS_TX_MAX, channels->max_tx) || nla_put_u32(skb, ETHTOOL_A_CHANNELS_TX_COUNT, channels->tx_count))) || (channels->max_other && (nla_put_u32(skb, ETHTOOL_A_CHANNELS_OTHER_MAX, channels->max_other) || nla_put_u32(skb, ETHTOOL_A_CHANNELS_OTHER_COUNT, channels->other_count))) || (channels->max_combined && (nla_put_u32(skb, ETHTOOL_A_CHANNELS_COMBINED_MAX, channels->max_combined) || nla_put_u32(skb, ETHTOOL_A_CHANNELS_COMBINED_COUNT, channels->combined_count)))) return -EMSGSIZE; return 0; } /* CHANNELS_SET */ const struct nla_policy ethnl_channels_set_policy[] = { [ETHTOOL_A_CHANNELS_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy), [ETHTOOL_A_CHANNELS_RX_COUNT] = { .type = NLA_U32 }, [ETHTOOL_A_CHANNELS_TX_COUNT] = { .type = NLA_U32 }, [ETHTOOL_A_CHANNELS_OTHER_COUNT] = { .type = NLA_U32 }, [ETHTOOL_A_CHANNELS_COMBINED_COUNT] = { .type = NLA_U32 }, }; static int ethnl_set_channels_validate(struct ethnl_req_info *req_info, struct genl_info *info) { const struct ethtool_ops *ops = req_info->dev->ethtool_ops; return ops->get_channels && ops->set_channels ? 1 : -EOPNOTSUPP; } static int ethnl_set_channels(struct ethnl_req_info *req_info, struct genl_info *info) { unsigned int from_channel, old_total, i; bool mod = false, mod_combined = false; struct net_device *dev = req_info->dev; struct ethtool_channels channels = {}; struct nlattr **tb = info->attrs; u32 err_attr; int ret; dev->ethtool_ops->get_channels(dev, &channels); old_total = channels.combined_count + max(channels.rx_count, channels.tx_count); ethnl_update_u32(&channels.rx_count, tb[ETHTOOL_A_CHANNELS_RX_COUNT], &mod); ethnl_update_u32(&channels.tx_count, tb[ETHTOOL_A_CHANNELS_TX_COUNT], &mod); ethnl_update_u32(&channels.other_count, tb[ETHTOOL_A_CHANNELS_OTHER_COUNT], &mod); ethnl_update_u32(&channels.combined_count, tb[ETHTOOL_A_CHANNELS_COMBINED_COUNT], &mod_combined); mod |= mod_combined; if (!mod) return 0; /* ensure new channel counts are within limits */ if (channels.rx_count > channels.max_rx) err_attr = ETHTOOL_A_CHANNELS_RX_COUNT; else if (channels.tx_count > channels.max_tx) err_attr = ETHTOOL_A_CHANNELS_TX_COUNT; else if (channels.other_count > channels.max_other) err_attr = ETHTOOL_A_CHANNELS_OTHER_COUNT; else if (channels.combined_count > channels.max_combined) err_attr = ETHTOOL_A_CHANNELS_COMBINED_COUNT; else err_attr = 0; if (err_attr) { NL_SET_ERR_MSG_ATTR(info->extack, tb[err_attr], "requested channel count exceeds maximum"); return -EINVAL; } /* ensure there is at least one RX and one TX channel */ if (!channels.combined_count && !channels.rx_count) err_attr = ETHTOOL_A_CHANNELS_RX_COUNT; else if (!channels.combined_count && !channels.tx_count) err_attr = ETHTOOL_A_CHANNELS_TX_COUNT; else err_attr = 0; if (err_attr) { if (mod_combined) err_attr = ETHTOOL_A_CHANNELS_COMBINED_COUNT; NL_SET_ERR_MSG_ATTR(info->extack, tb[err_attr], "requested channel counts would result in no RX or TX channel being configured"); return -EINVAL; } ret = ethtool_check_max_channel(dev, channels, info); if (ret) return ret; /* Disabling channels, query zero-copy AF_XDP sockets */ from_channel = channels.combined_count + min(channels.rx_count, channels.tx_count); for (i = from_channel; i < old_total; i++) if (xsk_get_pool_from_qid(dev, i)) { GENL_SET_ERR_MSG(info, "requested channel counts are too low for existing zerocopy AF_XDP sockets"); return -EINVAL; } ret = dev->ethtool_ops->set_channels(dev, &channels); return ret < 0 ? ret : 1; } const struct ethnl_request_ops ethnl_channels_request_ops = { .request_cmd = ETHTOOL_MSG_CHANNELS_GET, .reply_cmd = ETHTOOL_MSG_CHANNELS_GET_REPLY, .hdr_attr = ETHTOOL_A_CHANNELS_HEADER, .req_info_size = sizeof(struct channels_req_info), .reply_data_size = sizeof(struct channels_reply_data), .prepare_data = channels_prepare_data, .reply_size = channels_reply_size, .fill_reply = channels_fill_reply, .set_validate = ethnl_set_channels_validate, .set = ethnl_set_channels, .set_ntf_cmd = ETHTOOL_MSG_CHANNELS_NTF, }; |
| 39 15 24 24 39 1 47 5 42 1 3 3 2 51 1 1 1 48 39 3 39 39 39 39 15 15 15 15 2 2 2 2 2 2 2 2 22 22 22 22 22 22 22 22 2 2 2 2 2 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 | // SPDX-License-Identifier: GPL-2.0-only /* * net/sched/sch_mqprio.c * * Copyright (c) 2010 John Fastabend <john.r.fastabend@intel.com> */ #include <linux/ethtool_netlink.h> #include <linux/types.h> #include <linux/slab.h> #include <linux/kernel.h> #include <linux/string.h> #include <linux/errno.h> #include <linux/skbuff.h> #include <linux/module.h> #include <net/netlink.h> #include <net/pkt_sched.h> #include <net/sch_generic.h> #include <net/pkt_cls.h> #include "sch_mqprio_lib.h" struct mqprio_sched { struct Qdisc **qdiscs; u16 mode; u16 shaper; int hw_offload; u32 flags; u64 min_rate[TC_QOPT_MAX_QUEUE]; u64 max_rate[TC_QOPT_MAX_QUEUE]; u32 fp[TC_QOPT_MAX_QUEUE]; }; static int mqprio_enable_offload(struct Qdisc *sch, const struct tc_mqprio_qopt *qopt, struct netlink_ext_ack *extack) { struct mqprio_sched *priv = qdisc_priv(sch); struct net_device *dev = qdisc_dev(sch); struct tc_mqprio_qopt_offload mqprio = { .qopt = *qopt, .extack = extack, }; int err, i; switch (priv->mode) { case TC_MQPRIO_MODE_DCB: if (priv->shaper != TC_MQPRIO_SHAPER_DCB) return -EINVAL; break; case TC_MQPRIO_MODE_CHANNEL: mqprio.flags = priv->flags; if (priv->flags & TC_MQPRIO_F_MODE) mqprio.mode = priv->mode; if (priv->flags & TC_MQPRIO_F_SHAPER) mqprio.shaper = priv->shaper; if (priv->flags & TC_MQPRIO_F_MIN_RATE) for (i = 0; i < mqprio.qopt.num_tc; i++) mqprio.min_rate[i] = priv->min_rate[i]; if (priv->flags & TC_MQPRIO_F_MAX_RATE) for (i = 0; i < mqprio.qopt.num_tc; i++) mqprio.max_rate[i] = priv->max_rate[i]; break; default: return -EINVAL; } mqprio_fp_to_offload(priv->fp, &mqprio); err = dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_MQPRIO, &mqprio); if (err) return err; priv->hw_offload = mqprio.qopt.hw; return 0; } static void mqprio_disable_offload(struct Qdisc *sch) { struct tc_mqprio_qopt_offload mqprio = { { 0 } }; struct mqprio_sched *priv = qdisc_priv(sch); struct net_device *dev = qdisc_dev(sch); switch (priv->mode) { case TC_MQPRIO_MODE_DCB: case TC_MQPRIO_MODE_CHANNEL: dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_MQPRIO, &mqprio); break; } } static void mqprio_destroy(struct Qdisc *sch) { struct net_device *dev = qdisc_dev(sch); struct mqprio_sched *priv = qdisc_priv(sch); unsigned int ntx; if (priv->qdiscs) { for (ntx = 0; ntx < dev->num_tx_queues && priv->qdiscs[ntx]; ntx++) qdisc_put(priv->qdiscs[ntx]); kfree(priv->qdiscs); } if (priv->hw_offload && dev->netdev_ops->ndo_setup_tc) mqprio_disable_offload(sch); else netdev_set_num_tc(dev, 0); } static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt, const struct tc_mqprio_caps *caps, struct netlink_ext_ack *extack) { int err; /* Limit qopt->hw to maximum supported offload value. Drivers have * the option of overriding this later if they don't support the a * given offload type. */ if (qopt->hw > TC_MQPRIO_HW_OFFLOAD_MAX) qopt->hw = TC_MQPRIO_HW_OFFLOAD_MAX; /* If hardware offload is requested, we will leave 3 options to the * device driver: * - populate the queue counts itself (and ignore what was requested) * - validate the provided queue counts by itself (and apply them) * - request queue count validation here (and apply them) */ err = mqprio_validate_qopt(dev, qopt, !qopt->hw || caps->validate_queue_counts, false, extack); if (err) return err; /* If ndo_setup_tc is not present then hardware doesn't support offload * and we should return an error. */ if (qopt->hw && !dev->netdev_ops->ndo_setup_tc) { NL_SET_ERR_MSG(extack, "Device does not support hardware offload"); return -EINVAL; } return 0; } static const struct nla_policy mqprio_tc_entry_policy[TCA_MQPRIO_TC_ENTRY_MAX + 1] = { [TCA_MQPRIO_TC_ENTRY_INDEX] = NLA_POLICY_MAX(NLA_U32, TC_QOPT_MAX_QUEUE - 1), [TCA_MQPRIO_TC_ENTRY_FP] = NLA_POLICY_RANGE(NLA_U32, TC_FP_EXPRESS, TC_FP_PREEMPTIBLE), }; static const struct nla_policy mqprio_policy[TCA_MQPRIO_MAX + 1] = { [TCA_MQPRIO_MODE] = { .len = sizeof(u16) }, [TCA_MQPRIO_SHAPER] = { .len = sizeof(u16) }, [TCA_MQPRIO_MIN_RATE64] = { .type = NLA_NESTED }, [TCA_MQPRIO_MAX_RATE64] = { .type = NLA_NESTED }, [TCA_MQPRIO_TC_ENTRY] = { .type = NLA_NESTED }, }; static int mqprio_parse_tc_entry(u32 fp[TC_QOPT_MAX_QUEUE], struct nlattr *opt, unsigned long *seen_tcs, struct netlink_ext_ack *extack) { struct nlattr *tb[TCA_MQPRIO_TC_ENTRY_MAX + 1]; int err, tc; err = nla_parse_nested(tb, TCA_MQPRIO_TC_ENTRY_MAX, opt, mqprio_tc_entry_policy, extack); if (err < 0) return err; if (NL_REQ_ATTR_CHECK(extack, opt, tb, TCA_MQPRIO_TC_ENTRY_INDEX)) { NL_SET_ERR_MSG(extack, "TC entry index missing"); return -EINVAL; } tc = nla_get_u32(tb[TCA_MQPRIO_TC_ENTRY_INDEX]); if (*seen_tcs & BIT(tc)) { NL_SET_ERR_MSG_ATTR(extack, tb[TCA_MQPRIO_TC_ENTRY_INDEX], "Duplicate tc entry"); return -EINVAL; } *seen_tcs |= BIT(tc); if (tb[TCA_MQPRIO_TC_ENTRY_FP]) fp[tc] = nla_get_u32(tb[TCA_MQPRIO_TC_ENTRY_FP]); return 0; } static int mqprio_parse_tc_entries(struct Qdisc *sch, struct nlattr *nlattr_opt, int nlattr_opt_len, struct netlink_ext_ack *extack) { struct mqprio_sched *priv = qdisc_priv(sch); struct net_device *dev = qdisc_dev(sch); bool have_preemption = false; unsigned long seen_tcs = 0; u32 fp[TC_QOPT_MAX_QUEUE]; struct nlattr *n; int tc, rem; int err = 0; for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++) fp[tc] = priv->fp[tc]; nla_for_each_attr_type(n, TCA_MQPRIO_TC_ENTRY, nlattr_opt, nlattr_opt_len, rem) { err = mqprio_parse_tc_entry(fp, n, &seen_tcs, extack); if (err) goto out; } for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++) { priv->fp[tc] = fp[tc]; if (fp[tc] == TC_FP_PREEMPTIBLE) have_preemption = true; } if (have_preemption && !ethtool_dev_mm_supported(dev)) { NL_SET_ERR_MSG(extack, "Device does not support preemption"); return -EOPNOTSUPP; } out: return err; } /* Parse the other netlink attributes that represent the payload of * TCA_OPTIONS, which are appended right after struct tc_mqprio_qopt. */ static int mqprio_parse_nlattr(struct Qdisc *sch, struct tc_mqprio_qopt *qopt, struct nlattr *opt, struct netlink_ext_ack *extack) { struct nlattr *nlattr_opt = nla_data(opt) + NLA_ALIGN(sizeof(*qopt)); int nlattr_opt_len = nla_len(opt) - NLA_ALIGN(sizeof(*qopt)); struct mqprio_sched *priv = qdisc_priv(sch); struct nlattr *tb[TCA_MQPRIO_MAX + 1] = {}; struct nlattr *attr; int i, rem, err; if (nlattr_opt_len >= nla_attr_size(0)) { err = nla_parse_deprecated(tb, TCA_MQPRIO_MAX, nlattr_opt, nlattr_opt_len, mqprio_policy, NULL); if (err < 0) return err; } if (!qopt->hw) { NL_SET_ERR_MSG(extack, "mqprio TCA_OPTIONS can only contain netlink attributes in hardware mode"); return -EINVAL; } if (tb[TCA_MQPRIO_MODE]) { priv->flags |= TC_MQPRIO_F_MODE; priv->mode = nla_get_u16(tb[TCA_MQPRIO_MODE]); } if (tb[TCA_MQPRIO_SHAPER]) { priv->flags |= TC_MQPRIO_F_SHAPER; priv->shaper = nla_get_u16(tb[TCA_MQPRIO_SHAPER]); } if (tb[TCA_MQPRIO_MIN_RATE64]) { if (priv->shaper != TC_MQPRIO_SHAPER_BW_RATE) { NL_SET_ERR_MSG_ATTR(extack, tb[TCA_MQPRIO_MIN_RATE64], "min_rate accepted only when shaper is in bw_rlimit mode"); return -EINVAL; } i = 0; nla_for_each_nested(attr, tb[TCA_MQPRIO_MIN_RATE64], rem) { if (nla_type(attr) != TCA_MQPRIO_MIN_RATE64) { NL_SET_ERR_MSG_ATTR(extack, attr, "Attribute type expected to be TCA_MQPRIO_MIN_RATE64"); return -EINVAL; } if (nla_len(attr) != sizeof(u64)) { NL_SET_ERR_MSG_ATTR(extack, attr, "Attribute TCA_MQPRIO_MIN_RATE64 expected to have 8 bytes length"); return -EINVAL; } if (i >= qopt->num_tc) break; priv->min_rate[i] = nla_get_u64(attr); i++; } priv->flags |= TC_MQPRIO_F_MIN_RATE; } if (tb[TCA_MQPRIO_MAX_RATE64]) { if (priv->shaper != TC_MQPRIO_SHAPER_BW_RATE) { NL_SET_ERR_MSG_ATTR(extack, tb[TCA_MQPRIO_MAX_RATE64], "max_rate accepted only when shaper is in bw_rlimit mode"); return -EINVAL; } i = 0; nla_for_each_nested(attr, tb[TCA_MQPRIO_MAX_RATE64], rem) { if (nla_type(attr) != TCA_MQPRIO_MAX_RATE64) { NL_SET_ERR_MSG_ATTR(extack, attr, "Attribute type expected to be TCA_MQPRIO_MAX_RATE64"); return -EINVAL; } if (nla_len(attr) != sizeof(u64)) { NL_SET_ERR_MSG_ATTR(extack, attr, "Attribute TCA_MQPRIO_MAX_RATE64 expected to have 8 bytes length"); return -EINVAL; } if (i >= qopt->num_tc) break; priv->max_rate[i] = nla_get_u64(attr); i++; } priv->flags |= TC_MQPRIO_F_MAX_RATE; } if (tb[TCA_MQPRIO_TC_ENTRY]) { err = mqprio_parse_tc_entries(sch, nlattr_opt, nlattr_opt_len, extack); if (err) return err; } return 0; } static int mqprio_init(struct Qdisc *sch, struct nlattr *opt, struct netlink_ext_ack *extack) { struct net_device *dev = qdisc_dev(sch); struct mqprio_sched *priv = qdisc_priv(sch); struct netdev_queue *dev_queue; struct Qdisc *qdisc; int i, err = -EOPNOTSUPP; struct tc_mqprio_qopt *qopt = NULL; struct tc_mqprio_caps caps; int len, tc; BUILD_BUG_ON(TC_MAX_QUEUE != TC_QOPT_MAX_QUEUE); BUILD_BUG_ON(TC_BITMASK != TC_QOPT_BITMASK); if (sch->parent != TC_H_ROOT) return -EOPNOTSUPP; if (!netif_is_multiqueue(dev)) return -EOPNOTSUPP; /* make certain can allocate enough classids to handle queues */ if (dev->num_tx_queues >= TC_H_MIN_PRIORITY) return -ENOMEM; if (!opt || nla_len(opt) < sizeof(*qopt)) return -EINVAL; for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++) priv->fp[tc] = TC_FP_EXPRESS; qdisc_offload_query_caps(dev, TC_SETUP_QDISC_MQPRIO, &caps, sizeof(caps)); qopt = nla_data(opt); if (mqprio_parse_opt(dev, qopt, &caps, extack)) return -EINVAL; len = nla_len(opt) - NLA_ALIGN(sizeof(*qopt)); if (len > 0) { err = mqprio_parse_nlattr(sch, qopt, opt, extack); if (err) return err; } /* pre-allocate qdisc, attachment can't fail */ priv->qdiscs = kcalloc(dev->num_tx_queues, sizeof(priv->qdiscs[0]), GFP_KERNEL); if (!priv->qdiscs) return -ENOMEM; for (i = 0; i < dev->num_tx_queues; i++) { dev_queue = netdev_get_tx_queue(dev, i); qdisc = qdisc_create_dflt(dev_queue, get_default_qdisc_ops(dev, i), TC_H_MAKE(TC_H_MAJ(sch->handle), TC_H_MIN(i + 1)), extack); if (!qdisc) return -ENOMEM; priv->qdiscs[i] = qdisc; qdisc->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT; } /* If the mqprio options indicate that hardware should own * the queue mapping then run ndo_setup_tc otherwise use the * supplied and verified mapping */ if (qopt->hw) { err = mqprio_enable_offload(sch, qopt, extack); if (err) return err; } else { netdev_set_num_tc(dev, qopt->num_tc); for (i = 0; i < qopt->num_tc; i++) netdev_set_tc_queue(dev, i, qopt->count[i], qopt->offset[i]); } /* Always use supplied priority mappings */ for (i = 0; i < TC_BITMASK + 1; i++) netdev_set_prio_tc_map(dev, i, qopt->prio_tc_map[i]); sch->flags |= TCQ_F_MQROOT; return 0; } static void mqprio_attach(struct Qdisc *sch) { struct net_device *dev = qdisc_dev(sch); struct mqprio_sched *priv = qdisc_priv(sch); struct Qdisc *qdisc, *old; unsigned int ntx; /* Attach underlying qdisc */ for (ntx = 0; ntx < dev->num_tx_queues; ntx++) { qdisc = priv->qdiscs[ntx]; old = dev_graft_qdisc(qdisc->dev_queue, qdisc); if (old) qdisc_put(old); if (ntx < dev->real_num_tx_queues) qdisc_hash_add(qdisc, false); } kfree(priv->qdiscs); priv->qdiscs = NULL; } static struct netdev_queue *mqprio_queue_get(struct Qdisc *sch, unsigned long cl) { struct net_device *dev = qdisc_dev(sch); unsigned long ntx = cl - 1; if (ntx >= dev->num_tx_queues) return NULL; return netdev_get_tx_queue(dev, ntx); } static int mqprio_graft(struct Qdisc *sch, unsigned long cl, struct Qdisc *new, struct Qdisc **old, struct netlink_ext_ack *extack) { struct net_device *dev = qdisc_dev(sch); struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl); if (!dev_queue) return -EINVAL; if (dev->flags & IFF_UP) dev_deactivate(dev); *old = dev_graft_qdisc(dev_queue, new); if (new) new->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT; if (dev->flags & IFF_UP) dev_activate(dev); return 0; } static int dump_rates(struct mqprio_sched *priv, struct tc_mqprio_qopt *opt, struct sk_buff *skb) { struct nlattr *nest; int i; if (priv->flags & TC_MQPRIO_F_MIN_RATE) { nest = nla_nest_start_noflag(skb, TCA_MQPRIO_MIN_RATE64); if (!nest) goto nla_put_failure; for (i = 0; i < opt->num_tc; i++) { if (nla_put(skb, TCA_MQPRIO_MIN_RATE64, sizeof(priv->min_rate[i]), &priv->min_rate[i])) goto nla_put_failure; } nla_nest_end(skb, nest); } if (priv->flags & TC_MQPRIO_F_MAX_RATE) { nest = nla_nest_start_noflag(skb, TCA_MQPRIO_MAX_RATE64); if (!nest) goto nla_put_failure; for (i = 0; i < opt->num_tc; i++) { if (nla_put(skb, TCA_MQPRIO_MAX_RATE64, sizeof(priv->max_rate[i]), &priv->max_rate[i])) goto nla_put_failure; } nla_nest_end(skb, nest); } return 0; nla_put_failure: nla_nest_cancel(skb, nest); return -1; } static int mqprio_dump_tc_entries(struct mqprio_sched *priv, struct sk_buff *skb) { struct nlattr *n; int tc; for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++) { n = nla_nest_start(skb, TCA_MQPRIO_TC_ENTRY); if (!n) return -EMSGSIZE; if (nla_put_u32(skb, TCA_MQPRIO_TC_ENTRY_INDEX, tc)) goto nla_put_failure; if (nla_put_u32(skb, TCA_MQPRIO_TC_ENTRY_FP, priv->fp[tc])) goto nla_put_failure; nla_nest_end(skb, n); } return 0; nla_put_failure: nla_nest_cancel(skb, n); return -EMSGSIZE; } static int mqprio_dump(struct Qdisc *sch, struct sk_buff *skb) { struct net_device *dev = qdisc_dev(sch); struct mqprio_sched *priv = qdisc_priv(sch); struct nlattr *nla = (struct nlattr *)skb_tail_pointer(skb); struct tc_mqprio_qopt opt = { 0 }; struct Qdisc *qdisc; unsigned int ntx; sch->q.qlen = 0; gnet_stats_basic_sync_init(&sch->bstats); memset(&sch->qstats, 0, sizeof(sch->qstats)); /* MQ supports lockless qdiscs. However, statistics accounting needs * to account for all, none, or a mix of locked and unlocked child * qdiscs. Percpu stats are added to counters in-band and locking * qdisc totals are added at end. */ for (ntx = 0; ntx < dev->num_tx_queues; ntx++) { qdisc = rtnl_dereference(netdev_get_tx_queue(dev, ntx)->qdisc_sleeping); spin_lock_bh(qdisc_lock(qdisc)); gnet_stats_add_basic(&sch->bstats, qdisc->cpu_bstats, &qdisc->bstats, false); gnet_stats_add_queue(&sch->qstats, qdisc->cpu_qstats, &qdisc->qstats); sch->q.qlen += qdisc_qlen(qdisc); spin_unlock_bh(qdisc_lock(qdisc)); } mqprio_qopt_reconstruct(dev, &opt); opt.hw = priv->hw_offload; if (nla_put(skb, TCA_OPTIONS, sizeof(opt), &opt)) goto nla_put_failure; if ((priv->flags & TC_MQPRIO_F_MODE) && nla_put_u16(skb, TCA_MQPRIO_MODE, priv->mode)) goto nla_put_failure; if ((priv->flags & TC_MQPRIO_F_SHAPER) && nla_put_u16(skb, TCA_MQPRIO_SHAPER, priv->shaper)) goto nla_put_failure; if ((priv->flags & TC_MQPRIO_F_MIN_RATE || priv->flags & TC_MQPRIO_F_MAX_RATE) && (dump_rates(priv, &opt, skb) != 0)) goto nla_put_failure; if (mqprio_dump_tc_entries(priv, skb)) goto nla_put_failure; return nla_nest_end(skb, nla); nla_put_failure: nlmsg_trim(skb, nla); return -1; } static struct Qdisc *mqprio_leaf(struct Qdisc *sch, unsigned long cl) { struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl); if (!dev_queue) return NULL; return rtnl_dereference(dev_queue->qdisc_sleeping); } static unsigned long mqprio_find(struct Qdisc *sch, u32 classid) { struct net_device *dev = qdisc_dev(sch); unsigned int ntx = TC_H_MIN(classid); /* There are essentially two regions here that have valid classid * values. The first region will have a classid value of 1 through * num_tx_queues. All of these are backed by actual Qdiscs. */ if (ntx < TC_H_MIN_PRIORITY) return (ntx <= dev->num_tx_queues) ? ntx : 0; /* The second region represents the hardware traffic classes. These * are represented by classid values of TC_H_MIN_PRIORITY through * TC_H_MIN_PRIORITY + netdev_get_num_tc - 1 */ return ((ntx - TC_H_MIN_PRIORITY) < netdev_get_num_tc(dev)) ? ntx : 0; } static int mqprio_dump_class(struct Qdisc *sch, unsigned long cl, struct sk_buff *skb, struct tcmsg *tcm) { if (cl < TC_H_MIN_PRIORITY) { struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl); struct net_device *dev = qdisc_dev(sch); int tc = netdev_txq_to_tc(dev, cl - 1); tcm->tcm_parent = (tc < 0) ? 0 : TC_H_MAKE(TC_H_MAJ(sch->handle), TC_H_MIN(tc + TC_H_MIN_PRIORITY)); tcm->tcm_info = rtnl_dereference(dev_queue->qdisc_sleeping)->handle; } else { tcm->tcm_parent = TC_H_ROOT; tcm->tcm_info = 0; } tcm->tcm_handle |= TC_H_MIN(cl); return 0; } static int mqprio_dump_class_stats(struct Qdisc *sch, unsigned long cl, struct gnet_dump *d) __releases(d->lock) __acquires(d->lock) { if (cl >= TC_H_MIN_PRIORITY) { int i; __u32 qlen; struct gnet_stats_queue qstats = {0}; struct gnet_stats_basic_sync bstats; struct net_device *dev = qdisc_dev(sch); struct netdev_tc_txq tc = dev->tc_to_txq[cl & TC_BITMASK]; gnet_stats_basic_sync_init(&bstats); /* Drop lock here it will be reclaimed before touching * statistics this is required because the d->lock we * hold here is the look on dev_queue->qdisc_sleeping * also acquired below. */ if (d->lock) spin_unlock_bh(d->lock); for (i = tc.offset; i < tc.offset + tc.count; i++) { struct netdev_queue *q = netdev_get_tx_queue(dev, i); struct Qdisc *qdisc = rtnl_dereference(q->qdisc); spin_lock_bh(qdisc_lock(qdisc)); gnet_stats_add_basic(&bstats, qdisc->cpu_bstats, &qdisc->bstats, false); gnet_stats_add_queue(&qstats, qdisc->cpu_qstats, &qdisc->qstats); sch->q.qlen += qdisc_qlen(qdisc); spin_unlock_bh(qdisc_lock(qdisc)); } qlen = qdisc_qlen(sch) + qstats.qlen; /* Reclaim root sleeping lock before completing stats */ if (d->lock) spin_lock_bh(d->lock); if (gnet_stats_copy_basic(d, NULL, &bstats, false) < 0 || gnet_stats_copy_queue(d, NULL, &qstats, qlen) < 0) return -1; } else { struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl); sch = rtnl_dereference(dev_queue->qdisc_sleeping); if (gnet_stats_copy_basic(d, sch->cpu_bstats, &sch->bstats, true) < 0 || qdisc_qstats_copy(d, sch) < 0) return -1; } return 0; } static void mqprio_walk(struct Qdisc *sch, struct qdisc_walker *arg) { struct net_device *dev = qdisc_dev(sch); unsigned long ntx; if (arg->stop) return; /* Walk hierarchy with a virtual class per tc */ arg->count = arg->skip; for (ntx = arg->skip; ntx < netdev_get_num_tc(dev); ntx++) { if (!tc_qdisc_stats_dump(sch, ntx + TC_H_MIN_PRIORITY, arg)) return; } /* Pad the values and skip over unused traffic classes */ if (ntx < TC_MAX_QUEUE) { arg->count = TC_MAX_QUEUE; ntx = TC_MAX_QUEUE; } /* Reset offset, sort out remaining per-queue qdiscs */ for (ntx -= TC_MAX_QUEUE; ntx < dev->num_tx_queues; ntx++) { if (arg->fn(sch, ntx + 1, arg) < 0) { arg->stop = 1; return; } arg->count++; } } static struct netdev_queue *mqprio_select_queue(struct Qdisc *sch, struct tcmsg *tcm) { return mqprio_queue_get(sch, TC_H_MIN(tcm->tcm_parent)); } static const struct Qdisc_class_ops mqprio_class_ops = { .graft = mqprio_graft, .leaf = mqprio_leaf, .find = mqprio_find, .walk = mqprio_walk, .dump = mqprio_dump_class, .dump_stats = mqprio_dump_class_stats, .select_queue = mqprio_select_queue, }; static struct Qdisc_ops mqprio_qdisc_ops __read_mostly = { .cl_ops = &mqprio_class_ops, .id = "mqprio", .priv_size = sizeof(struct mqprio_sched), .init = mqprio_init, .destroy = mqprio_destroy, .attach = mqprio_attach, .change_real_num_tx = mq_change_real_num_tx, .dump = mqprio_dump, .owner = THIS_MODULE, }; MODULE_ALIAS_NET_SCH("mqprio"); static int __init mqprio_module_init(void) { return register_qdisc(&mqprio_qdisc_ops); } static void __exit mqprio_module_exit(void) { unregister_qdisc(&mqprio_qdisc_ops); } module_init(mqprio_module_init); module_exit(mqprio_module_exit); MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("Classful multiqueue prio qdisc"); |
| 99 164 250 414 7 4 3 403 2 242 354 354 354 350 4 164 8 181 1 7 344 2 350 3172 1516 1736 231 1 1 230 3 1 2 2 47 30 1 16 1 16 7 7 5413 2370 1634 1783 4 4 25 21 2 13 21 4 1 1 1 2 4 2 1 2 2121 2 55 55 109 109 10 99 5 1 3 1 5 1 2 2 10 2 2 6 3 1 1 1 169 1 1 167 116 18 2 1 13 2 91 2 20 18 56 70 6 1 27 51 44 39 31 31 22 4 2 18 29 1 35 37 97 2 97 30 14 2 17 34 34 34 32 1 31 4 89 1 88 77 77 77 48 50 2 1 48 12 33 2 31 30 11 5 3 2 4 8 2 2 2 2 2 46 5 2 39 33 33 33 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 | // SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com * Copyright (c) 2016,2017 Facebook */ #include <linux/bpf.h> #include <linux/btf.h> #include <linux/err.h> #include <linux/slab.h> #include <linux/mm.h> #include <linux/filter.h> #include <linux/perf_event.h> #include <uapi/linux/btf.h> #include <linux/rcupdate_trace.h> #include <linux/btf_ids.h> #include "map_in_map.h" #define ARRAY_CREATE_FLAG_MASK \ (BPF_F_NUMA_NODE | BPF_F_MMAPABLE | BPF_F_ACCESS_MASK | \ BPF_F_PRESERVE_ELEMS | BPF_F_INNER_MAP) static void bpf_array_free_percpu(struct bpf_array *array) { int i; for (i = 0; i < array->map.max_entries; i++) { free_percpu(array->pptrs[i]); cond_resched(); } } static int bpf_array_alloc_percpu(struct bpf_array *array) { void __percpu *ptr; int i; for (i = 0; i < array->map.max_entries; i++) { ptr = bpf_map_alloc_percpu(&array->map, array->elem_size, 8, GFP_USER | __GFP_NOWARN); if (!ptr) { bpf_array_free_percpu(array); return -ENOMEM; } array->pptrs[i] = ptr; cond_resched(); } return 0; } /* Called from syscall */ int array_map_alloc_check(union bpf_attr *attr) { bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY; int numa_node = bpf_map_attr_numa_node(attr); /* check sanity of attributes */ if (attr->max_entries == 0 || attr->key_size != 4 || attr->value_size == 0 || attr->map_flags & ~ARRAY_CREATE_FLAG_MASK || !bpf_map_flags_access_ok(attr->map_flags) || (percpu && numa_node != NUMA_NO_NODE)) return -EINVAL; if (attr->map_type != BPF_MAP_TYPE_ARRAY && attr->map_flags & (BPF_F_MMAPABLE | BPF_F_INNER_MAP)) return -EINVAL; if (attr->map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY && attr->map_flags & BPF_F_PRESERVE_ELEMS) return -EINVAL; /* avoid overflow on round_up(map->value_size) */ if (attr->value_size > INT_MAX) return -E2BIG; /* percpu map value size is bound by PCPU_MIN_UNIT_SIZE */ if (percpu && round_up(attr->value_size, 8) > PCPU_MIN_UNIT_SIZE) return -E2BIG; return 0; } static struct bpf_map *array_map_alloc(union bpf_attr *attr) { bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY; int numa_node = bpf_map_attr_numa_node(attr); u32 elem_size, index_mask, max_entries; bool bypass_spec_v1 = bpf_bypass_spec_v1(NULL); u64 array_size, mask64; struct bpf_array *array; elem_size = round_up(attr->value_size, 8); max_entries = attr->max_entries; /* On 32 bit archs roundup_pow_of_two() with max_entries that has * upper most bit set in u32 space is undefined behavior due to * resulting 1U << 32, so do it manually here in u64 space. */ mask64 = fls_long(max_entries - 1); mask64 = 1ULL << mask64; mask64 -= 1; index_mask = mask64; if (!bypass_spec_v1) { /* round up array size to nearest power of 2, * since cpu will speculate within index_mask limits */ max_entries = index_mask + 1; /* Check for overflows. */ if (max_entries < attr->max_entries) return ERR_PTR(-E2BIG); } array_size = sizeof(*array); if (percpu) { array_size += (u64) max_entries * sizeof(void *); } else { /* rely on vmalloc() to return page-aligned memory and * ensure array->value is exactly page-aligned */ if (attr->map_flags & BPF_F_MMAPABLE) { array_size = PAGE_ALIGN(array_size); array_size += PAGE_ALIGN((u64) max_entries * elem_size); } else { array_size += (u64) max_entries * elem_size; } } /* allocate all map elements and zero-initialize them */ if (attr->map_flags & BPF_F_MMAPABLE) { void *data; /* kmalloc'ed memory can't be mmap'ed, use explicit vmalloc */ data = bpf_map_area_mmapable_alloc(array_size, numa_node); if (!data) return ERR_PTR(-ENOMEM); array = data + PAGE_ALIGN(sizeof(struct bpf_array)) - offsetof(struct bpf_array, value); } else { array = bpf_map_area_alloc(array_size, numa_node); } if (!array) return ERR_PTR(-ENOMEM); array->index_mask = index_mask; array->map.bypass_spec_v1 = bypass_spec_v1; /* copy mandatory map attributes */ bpf_map_init_from_attr(&array->map, attr); array->elem_size = elem_size; if (percpu && bpf_array_alloc_percpu(array)) { bpf_map_area_free(array); return ERR_PTR(-ENOMEM); } return &array->map; } static void *array_map_elem_ptr(struct bpf_array* array, u32 index) { return array->value + (u64)array->elem_size * index; } /* Called from syscall or from eBPF program */ static void *array_map_lookup_elem(struct bpf_map *map, void *key) { struct bpf_array *array = container_of(map, struct bpf_array, map); u32 index = *(u32 *)key; if (unlikely(index >= array->map.max_entries)) return NULL; return array->value + (u64)array->elem_size * (index & array->index_mask); } static int array_map_direct_value_addr(const struct bpf_map *map, u64 *imm, u32 off) { struct bpf_array *array = container_of(map, struct bpf_array, map); if (map->max_entries != 1) return -ENOTSUPP; if (off >= map->value_size) return -EINVAL; *imm = (unsigned long)array->value; return 0; } static int array_map_direct_value_meta(const struct bpf_map *map, u64 imm, u32 *off) { struct bpf_array *array = container_of(map, struct bpf_array, map); u64 base = (unsigned long)array->value; u64 range = array->elem_size; if (map->max_entries != 1) return -ENOTSUPP; if (imm < base || imm >= base + range) return -ENOENT; *off = imm - base; return 0; } /* emit BPF instructions equivalent to C code of array_map_lookup_elem() */ static int array_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf) { struct bpf_array *array = container_of(map, struct bpf_array, map); struct bpf_insn *insn = insn_buf; u32 elem_size = array->elem_size; const int ret = BPF_REG_0; const int map_ptr = BPF_REG_1; const int index = BPF_REG_2; if (map->map_flags & BPF_F_INNER_MAP) return -EOPNOTSUPP; *insn++ = BPF_ALU64_IMM(BPF_ADD, map_ptr, offsetof(struct bpf_array, value)); *insn++ = BPF_LDX_MEM(BPF_W, ret, index, 0); if (!map->bypass_spec_v1) { *insn++ = BPF_JMP_IMM(BPF_JGE, ret, map->max_entries, 4); *insn++ = BPF_ALU32_IMM(BPF_AND, ret, array->index_mask); } else { *insn++ = BPF_JMP_IMM(BPF_JGE, ret, map->max_entries, 3); } if (is_power_of_2(elem_size)) { *insn++ = BPF_ALU64_IMM(BPF_LSH, ret, ilog2(elem_size)); } else { *insn++ = BPF_ALU64_IMM(BPF_MUL, ret, elem_size); } *insn++ = BPF_ALU64_REG(BPF_ADD, ret, map_ptr); *insn++ = BPF_JMP_IMM(BPF_JA, 0, 0, 1); *insn++ = BPF_MOV64_IMM(ret, 0); return insn - insn_buf; } /* Called from eBPF program */ static void *percpu_array_map_lookup_elem(struct bpf_map *map, void *key) { struct bpf_array *array = container_of(map, struct bpf_array, map); u32 index = *(u32 *)key; if (unlikely(index >= array->map.max_entries)) return NULL; return this_cpu_ptr(array->pptrs[index & array->index_mask]); } /* emit BPF instructions equivalent to C code of percpu_array_map_lookup_elem() */ static int percpu_array_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf) { struct bpf_array *array = container_of(map, struct bpf_array, map); struct bpf_insn *insn = insn_buf; if (!bpf_jit_supports_percpu_insn()) return -EOPNOTSUPP; if (map->map_flags & BPF_F_INNER_MAP) return -EOPNOTSUPP; BUILD_BUG_ON(offsetof(struct bpf_array, map) != 0); *insn++ = BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, offsetof(struct bpf_array, pptrs)); *insn++ = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_2, 0); if (!map->bypass_spec_v1) { *insn++ = BPF_JMP_IMM(BPF_JGE, BPF_REG_0, map->max_entries, 6); *insn++ = BPF_ALU32_IMM(BPF_AND, BPF_REG_0, array->index_mask); } else { *insn++ = BPF_JMP_IMM(BPF_JGE, BPF_REG_0, map->max_entries, 5); } *insn++ = BPF_ALU64_IMM(BPF_LSH, BPF_REG_0, 3); *insn++ = BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1); *insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0); *insn++ = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0); *insn++ = BPF_JMP_IMM(BPF_JA, 0, 0, 1); *insn++ = BPF_MOV64_IMM(BPF_REG_0, 0); return insn - insn_buf; } static void *percpu_array_map_lookup_percpu_elem(struct bpf_map *map, void *key, u32 cpu) { struct bpf_array *array = container_of(map, struct bpf_array, map); u32 index = *(u32 *)key; if (cpu >= nr_cpu_ids) return NULL; if (unlikely(index >= array->map.max_entries)) return NULL; return per_cpu_ptr(array->pptrs[index & array->index_mask], cpu); } int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value) { struct bpf_array *array = container_of(map, struct bpf_array, map); u32 index = *(u32 *)key; void __percpu *pptr; int cpu, off = 0; u32 size; if (unlikely(index >= array->map.max_entries)) return -ENOENT; /* per_cpu areas are zero-filled and bpf programs can only * access 'value_size' of them, so copying rounded areas * will not leak any kernel data */ size = array->elem_size; rcu_read_lock(); pptr = array->pptrs[index & array->index_mask]; for_each_possible_cpu(cpu) { copy_map_value_long(map, value + off, per_cpu_ptr(pptr, cpu)); check_and_init_map_value(map, value + off); off += size; } rcu_read_unlock(); return 0; } /* Called from syscall */ static int array_map_get_next_key(struct bpf_map *map, void *key, void *next_key) { struct bpf_array *array = container_of(map, struct bpf_array, map); u32 index = key ? *(u32 *)key : U32_MAX; u32 *next = (u32 *)next_key; if (index >= array->map.max_entries) { *next = 0; return 0; } if (index == array->map.max_entries - 1) return -ENOENT; *next = index + 1; return 0; } /* Called from syscall or from eBPF program */ static long array_map_update_elem(struct bpf_map *map, void *key, void *value, u64 map_flags) { struct bpf_array *array = container_of(map, struct bpf_array, map); u32 index = *(u32 *)key; char *val; if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST)) /* unknown flags */ return -EINVAL; if (unlikely(index >= array->map.max_entries)) /* all elements were pre-allocated, cannot insert a new one */ return -E2BIG; if (unlikely(map_flags & BPF_NOEXIST)) /* all elements already exist */ return -EEXIST; if (unlikely((map_flags & BPF_F_LOCK) && !btf_record_has_field(map->record, BPF_SPIN_LOCK))) return -EINVAL; if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) { val = this_cpu_ptr(array->pptrs[index & array->index_mask]); copy_map_value(map, val, value); bpf_obj_free_fields(array->map.record, val); } else { val = array->value + (u64)array->elem_size * (index & array->index_mask); if (map_flags & BPF_F_LOCK) copy_map_value_locked(map, val, value, false); else copy_map_value(map, val, value); bpf_obj_free_fields(array->map.record, val); } return 0; } int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value, u64 map_flags) { struct bpf_array *array = container_of(map, struct bpf_array, map); u32 index = *(u32 *)key; void __percpu *pptr; int cpu, off = 0; u32 size; if (unlikely(map_flags > BPF_EXIST)) /* unknown flags */ return -EINVAL; if (unlikely(index >= array->map.max_entries)) /* all elements were pre-allocated, cannot insert a new one */ return -E2BIG; if (unlikely(map_flags == BPF_NOEXIST)) /* all elements already exist */ return -EEXIST; /* the user space will provide round_up(value_size, 8) bytes that * will be copied into per-cpu area. bpf programs can only access * value_size of it. During lookup the same extra bytes will be * returned or zeros which were zero-filled by percpu_alloc, * so no kernel data leaks possible */ size = array->elem_size; rcu_read_lock(); pptr = array->pptrs[index & array->index_mask]; for_each_possible_cpu(cpu) { copy_map_value_long(map, per_cpu_ptr(pptr, cpu), value + off); bpf_obj_free_fields(array->map.record, per_cpu_ptr(pptr, cpu)); off += size; } rcu_read_unlock(); return 0; } /* Called from syscall or from eBPF program */ static long array_map_delete_elem(struct bpf_map *map, void *key) { return -EINVAL; } static void *array_map_vmalloc_addr(struct bpf_array *array) { return (void *)round_down((unsigned long)array, PAGE_SIZE); } static void array_map_free_timers_wq(struct bpf_map *map) { struct bpf_array *array = container_of(map, struct bpf_array, map); int i; /* We don't reset or free fields other than timer and workqueue * on uref dropping to zero. */ if (btf_record_has_field(map->record, BPF_TIMER | BPF_WORKQUEUE)) { for (i = 0; i < array->map.max_entries; i++) { if (btf_record_has_field(map->record, BPF_TIMER)) bpf_obj_free_timer(map->record, array_map_elem_ptr(array, i)); if (btf_record_has_field(map->record, BPF_WORKQUEUE)) bpf_obj_free_workqueue(map->record, array_map_elem_ptr(array, i)); } } } /* Called when map->refcnt goes to zero, either from workqueue or from syscall */ static void array_map_free(struct bpf_map *map) { struct bpf_array *array = container_of(map, struct bpf_array, map); int i; if (!IS_ERR_OR_NULL(map->record)) { if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) { for (i = 0; i < array->map.max_entries; i++) { void __percpu *pptr = array->pptrs[i & array->index_mask]; int cpu; for_each_possible_cpu(cpu) { bpf_obj_free_fields(map->record, per_cpu_ptr(pptr, cpu)); cond_resched(); } } } else { for (i = 0; i < array->map.max_entries; i++) bpf_obj_free_fields(map->record, array_map_elem_ptr(array, i)); } } if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) bpf_array_free_percpu(array); if (array->map.map_flags & BPF_F_MMAPABLE) bpf_map_area_free(array_map_vmalloc_addr(array)); else bpf_map_area_free(array); } static void array_map_seq_show_elem(struct bpf_map *map, void *key, struct seq_file *m) { void *value; rcu_read_lock(); value = array_map_lookup_elem(map, key); if (!value) { rcu_read_unlock(); return; } if (map->btf_key_type_id) seq_printf(m, "%u: ", *(u32 *)key); btf_type_seq_show(map->btf, map->btf_value_type_id, value, m); seq_putc(m, '\n'); rcu_read_unlock(); } static void percpu_array_map_seq_show_elem(struct bpf_map *map, void *key, struct seq_file *m) { struct bpf_array *array = container_of(map, struct bpf_array, map); u32 index = *(u32 *)key; void __percpu *pptr; int cpu; rcu_read_lock(); seq_printf(m, "%u: {\n", *(u32 *)key); pptr = array->pptrs[index & array->index_mask]; for_each_possible_cpu(cpu) { seq_printf(m, "\tcpu%d: ", cpu); btf_type_seq_show(map->btf, map->btf_value_type_id, per_cpu_ptr(pptr, cpu), m); seq_putc(m, '\n'); } seq_puts(m, "}\n"); rcu_read_unlock(); } static int array_map_check_btf(const struct bpf_map *map, const struct btf *btf, const struct btf_type *key_type, const struct btf_type *value_type) { /* One exception for keyless BTF: .bss/.data/.rodata map */ if (btf_type_is_void(key_type)) { if (map->map_type != BPF_MAP_TYPE_ARRAY || map->max_entries != 1) return -EINVAL; if (BTF_INFO_KIND(value_type->info) != BTF_KIND_DATASEC) return -EINVAL; return 0; } /* * Bpf array can only take a u32 key. This check makes sure * that the btf matches the attr used during map_create. */ if (!btf_type_is_i32(key_type)) return -EINVAL; return 0; } static int array_map_mmap(struct bpf_map *map, struct vm_area_struct *vma) { struct bpf_array *array = container_of(map, struct bpf_array, map); pgoff_t pgoff = PAGE_ALIGN(sizeof(*array)) >> PAGE_SHIFT; if (!(map->map_flags & BPF_F_MMAPABLE)) return -EINVAL; if (vma->vm_pgoff * PAGE_SIZE + (vma->vm_end - vma->vm_start) > PAGE_ALIGN((u64)array->map.max_entries * array->elem_size)) return -EINVAL; return remap_vmalloc_range(vma, array_map_vmalloc_addr(array), vma->vm_pgoff + pgoff); } static bool array_map_meta_equal(const struct bpf_map *meta0, const struct bpf_map *meta1) { if (!bpf_map_meta_equal(meta0, meta1)) return false; return meta0->map_flags & BPF_F_INNER_MAP ? true : meta0->max_entries == meta1->max_entries; } struct bpf_iter_seq_array_map_info { struct bpf_map *map; void *percpu_value_buf; u32 index; }; static void *bpf_array_map_seq_start(struct seq_file *seq, loff_t *pos) { struct bpf_iter_seq_array_map_info *info = seq->private; struct bpf_map *map = info->map; struct bpf_array *array; u32 index; if (info->index >= map->max_entries) return NULL; if (*pos == 0) ++*pos; array = container_of(map, struct bpf_array, map); index = info->index & array->index_mask; if (info->percpu_value_buf) return (void *)(uintptr_t)array->pptrs[index]; return array_map_elem_ptr(array, index); } static void *bpf_array_map_seq_next(struct seq_file *seq, void *v, loff_t *pos) { struct bpf_iter_seq_array_map_info *info = seq->private; struct bpf_map *map = info->map; struct bpf_array *array; u32 index; ++*pos; ++info->index; if (info->index >= map->max_entries) return NULL; array = container_of(map, struct bpf_array, map); index = info->index & array->index_mask; if (info->percpu_value_buf) return (void *)(uintptr_t)array->pptrs[index]; return array_map_elem_ptr(array, index); } static int __bpf_array_map_seq_show(struct seq_file *seq, void *v) { struct bpf_iter_seq_array_map_info *info = seq->private; struct bpf_iter__bpf_map_elem ctx = {}; struct bpf_map *map = info->map; struct bpf_array *array = container_of(map, struct bpf_array, map); struct bpf_iter_meta meta; struct bpf_prog *prog; int off = 0, cpu = 0; void __percpu *pptr; u32 size; meta.seq = seq; prog = bpf_iter_get_info(&meta, v == NULL); if (!prog) return 0; ctx.meta = &meta; ctx.map = info->map; if (v) { ctx.key = &info->index; if (!info->percpu_value_buf) { ctx.value = v; } else { pptr = (void __percpu *)(uintptr_t)v; size = array->elem_size; for_each_possible_cpu(cpu) { copy_map_value_long(map, info->percpu_value_buf + off, per_cpu_ptr(pptr, cpu)); check_and_init_map_value(map, info->percpu_value_buf + off); off += size; } ctx.value = info->percpu_value_buf; } } return bpf_iter_run_prog(prog, &ctx); } static int bpf_array_map_seq_show(struct seq_file *seq, void *v) { return __bpf_array_map_seq_show(seq, v); } static void bpf_array_map_seq_stop(struct seq_file *seq, void *v) { if (!v) (void)__bpf_array_map_seq_show(seq, NULL); } static int bpf_iter_init_array_map(void *priv_data, struct bpf_iter_aux_info *aux) { struct bpf_iter_seq_array_map_info *seq_info = priv_data; struct bpf_map *map = aux->map; struct bpf_array *array = container_of(map, struct bpf_array, map); void *value_buf; u32 buf_size; if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) { buf_size = array->elem_size * num_possible_cpus(); value_buf = kmalloc(buf_size, GFP_USER | __GFP_NOWARN); if (!value_buf) return -ENOMEM; seq_info->percpu_value_buf = value_buf; } /* bpf_iter_attach_map() acquires a map uref, and the uref may be * released before or in the middle of iterating map elements, so * acquire an extra map uref for iterator. */ bpf_map_inc_with_uref(map); seq_info->map = map; return 0; } static void bpf_iter_fini_array_map(void *priv_data) { struct bpf_iter_seq_array_map_info *seq_info = priv_data; bpf_map_put_with_uref(seq_info->map); kfree(seq_info->percpu_value_buf); } static const struct seq_operations bpf_array_map_seq_ops = { .start = bpf_array_map_seq_start, .next = bpf_array_map_seq_next, .stop = bpf_array_map_seq_stop, .show = bpf_array_map_seq_show, }; static const struct bpf_iter_seq_info iter_seq_info = { .seq_ops = &bpf_array_map_seq_ops, .init_seq_private = bpf_iter_init_array_map, .fini_seq_private = bpf_iter_fini_array_map, .seq_priv_size = sizeof(struct bpf_iter_seq_array_map_info), }; static long bpf_for_each_array_elem(struct bpf_map *map, bpf_callback_t callback_fn, void *callback_ctx, u64 flags) { u32 i, key, num_elems = 0; struct bpf_array *array; bool is_percpu; u64 ret = 0; void *val; cant_migrate(); if (flags != 0) return -EINVAL; is_percpu = map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY; array = container_of(map, struct bpf_array, map); for (i = 0; i < map->max_entries; i++) { if (is_percpu) val = this_cpu_ptr(array->pptrs[i]); else val = array_map_elem_ptr(array, i); num_elems++; key = i; ret = callback_fn((u64)(long)map, (u64)(long)&key, (u64)(long)val, (u64)(long)callback_ctx, 0); /* return value: 0 - continue, 1 - stop and return */ if (ret) break; } return num_elems; } static u64 array_map_mem_usage(const struct bpf_map *map) { struct bpf_array *array = container_of(map, struct bpf_array, map); bool percpu = map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY; u32 elem_size = array->elem_size; u64 entries = map->max_entries; u64 usage = sizeof(*array); if (percpu) { usage += entries * sizeof(void *); usage += entries * elem_size * num_possible_cpus(); } else { if (map->map_flags & BPF_F_MMAPABLE) { usage = PAGE_ALIGN(usage); usage += PAGE_ALIGN(entries * elem_size); } else { usage += entries * elem_size; } } return usage; } BTF_ID_LIST_SINGLE(array_map_btf_ids, struct, bpf_array) const struct bpf_map_ops array_map_ops = { .map_meta_equal = array_map_meta_equal, .map_alloc_check = array_map_alloc_check, .map_alloc = array_map_alloc, .map_free = array_map_free, .map_get_next_key = array_map_get_next_key, .map_release_uref = array_map_free_timers_wq, .map_lookup_elem = array_map_lookup_elem, .map_update_elem = array_map_update_elem, .map_delete_elem = array_map_delete_elem, .map_gen_lookup = array_map_gen_lookup, .map_direct_value_addr = array_map_direct_value_addr, .map_direct_value_meta = array_map_direct_value_meta, .map_mmap = array_map_mmap, .map_seq_show_elem = array_map_seq_show_elem, .map_check_btf = array_map_check_btf, .map_lookup_batch = generic_map_lookup_batch, .map_update_batch = generic_map_update_batch, .map_set_for_each_callback_args = map_set_for_each_callback_args, .map_for_each_callback = bpf_for_each_array_elem, .map_mem_usage = array_map_mem_usage, .map_btf_id = &array_map_btf_ids[0], .iter_seq_info = &iter_seq_info, }; const struct bpf_map_ops percpu_array_map_ops = { .map_meta_equal = bpf_map_meta_equal, .map_alloc_check = array_map_alloc_check, .map_alloc = array_map_alloc, .map_free = array_map_free, .map_get_next_key = array_map_get_next_key, .map_lookup_elem = percpu_array_map_lookup_elem, .map_gen_lookup = percpu_array_map_gen_lookup, .map_update_elem = array_map_update_elem, .map_delete_elem = array_map_delete_elem, .map_lookup_percpu_elem = percpu_array_map_lookup_percpu_elem, .map_seq_show_elem = percpu_array_map_seq_show_elem, .map_check_btf = array_map_check_btf, .map_lookup_batch = generic_map_lookup_batch, .map_update_batch = generic_map_update_batch, .map_set_for_each_callback_args = map_set_for_each_callback_args, .map_for_each_callback = bpf_for_each_array_elem, .map_mem_usage = array_map_mem_usage, .map_btf_id = &array_map_btf_ids[0], .iter_seq_info = &iter_seq_info, }; static int fd_array_map_alloc_check(union bpf_attr *attr) { /* only file descriptors can be stored in this type of map */ if (attr->value_size != sizeof(u32)) return -EINVAL; /* Program read-only/write-only not supported for special maps yet. */ if (attr->map_flags & (BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG)) return -EINVAL; return array_map_alloc_check(attr); } static void fd_array_map_free(struct bpf_map *map) { struct bpf_array *array = container_of(map, struct bpf_array, map); int i; /* make sure it's empty */ for (i = 0; i < array->map.max_entries; i++) BUG_ON(array->ptrs[i] != NULL); bpf_map_area_free(array); } static void *fd_array_map_lookup_elem(struct bpf_map *map, void *key) { return ERR_PTR(-EOPNOTSUPP); } /* only called from syscall */ int bpf_fd_array_map_lookup_elem(struct bpf_map *map, void *key, u32 *value) { void **elem, *ptr; int ret = 0; if (!map->ops->map_fd_sys_lookup_elem) return -ENOTSUPP; rcu_read_lock(); elem = array_map_lookup_elem(map, key); if (elem && (ptr = READ_ONCE(*elem))) *value = map->ops->map_fd_sys_lookup_elem(ptr); else ret = -ENOENT; rcu_read_unlock(); return ret; } /* only called from syscall */ int bpf_fd_array_map_update_elem(struct bpf_map *map, struct file *map_file, void *key, void *value, u64 map_flags) { struct bpf_array *array = container_of(map, struct bpf_array, map); void *new_ptr, *old_ptr; u32 index = *(u32 *)key, ufd; if (map_flags != BPF_ANY) return -EINVAL; if (index >= array->map.max_entries) return -E2BIG; ufd = *(u32 *)value; new_ptr = map->ops->map_fd_get_ptr(map, map_file, ufd); if (IS_ERR(new_ptr)) return PTR_ERR(new_ptr); if (map->ops->map_poke_run) { mutex_lock(&array->aux->poke_mutex); old_ptr = xchg(array->ptrs + index, new_ptr); map->ops->map_poke_run(map, index, old_ptr, new_ptr); mutex_unlock(&array->aux->poke_mutex); } else { old_ptr = xchg(array->ptrs + index, new_ptr); } if (old_ptr) map->ops->map_fd_put_ptr(map, old_ptr, true); return 0; } static long __fd_array_map_delete_elem(struct bpf_map *map, void *key, bool need_defer) { struct bpf_array *array = container_of(map, struct bpf_array, map); void *old_ptr; u32 index = *(u32 *)key; if (index >= array->map.max_entries) return -E2BIG; if (map->ops->map_poke_run) { mutex_lock(&array->aux->poke_mutex); old_ptr = xchg(array->ptrs + index, NULL); map->ops->map_poke_run(map, index, old_ptr, NULL); mutex_unlock(&array->aux->poke_mutex); } else { old_ptr = xchg(array->ptrs + index, NULL); } if (old_ptr) { map->ops->map_fd_put_ptr(map, old_ptr, need_defer); return 0; } else { return -ENOENT; } } static long fd_array_map_delete_elem(struct bpf_map *map, void *key) { return __fd_array_map_delete_elem(map, key, true); } static void *prog_fd_array_get_ptr(struct bpf_map *map, struct file *map_file, int fd) { struct bpf_prog *prog = bpf_prog_get(fd); bool is_extended; if (IS_ERR(prog)) return prog; if (prog->type == BPF_PROG_TYPE_EXT || !bpf_prog_map_compatible(map, prog)) { bpf_prog_put(prog); return ERR_PTR(-EINVAL); } mutex_lock(&prog->aux->ext_mutex); is_extended = prog->aux->is_extended; if (!is_extended) prog->aux->prog_array_member_cnt++; mutex_unlock(&prog->aux->ext_mutex); if (is_extended) { /* Extended prog can not be tail callee. It's to prevent a * potential infinite loop like: * tail callee prog entry -> tail callee prog subprog -> * freplace prog entry --tailcall-> tail callee prog entry. */ bpf_prog_put(prog); return ERR_PTR(-EBUSY); } return prog; } static void prog_fd_array_put_ptr(struct bpf_map *map, void *ptr, bool need_defer) { struct bpf_prog *prog = ptr; mutex_lock(&prog->aux->ext_mutex); prog->aux->prog_array_member_cnt--; mutex_unlock(&prog->aux->ext_mutex); /* bpf_prog is freed after one RCU or tasks trace grace period */ bpf_prog_put(prog); } static u32 prog_fd_array_sys_lookup_elem(void *ptr) { return ((struct bpf_prog *)ptr)->aux->id; } /* decrement refcnt of all bpf_progs that are stored in this map */ static void bpf_fd_array_map_clear(struct bpf_map *map, bool need_defer) { struct bpf_array *array = container_of(map, struct bpf_array, map); int i; for (i = 0; i < array->map.max_entries; i++) __fd_array_map_delete_elem(map, &i, need_defer); } static void prog_array_map_seq_show_elem(struct bpf_map *map, void *key, struct seq_file *m) { void **elem, *ptr; u32 prog_id; rcu_read_lock(); elem = array_map_lookup_elem(map, key); if (elem) { ptr = READ_ONCE(*elem); if (ptr) { seq_printf(m, "%u: ", *(u32 *)key); prog_id = prog_fd_array_sys_lookup_elem(ptr); btf_type_seq_show(map->btf, map->btf_value_type_id, &prog_id, m); seq_putc(m, '\n'); } } rcu_read_unlock(); } struct prog_poke_elem { struct list_head list; struct bpf_prog_aux *aux; }; static int prog_array_map_poke_track(struct bpf_map *map, struct bpf_prog_aux *prog_aux) { struct prog_poke_elem *elem; struct bpf_array_aux *aux; int ret = 0; aux = container_of(map, struct bpf_array, map)->aux; mutex_lock(&aux->poke_mutex); list_for_each_entry(elem, &aux->poke_progs, list) { if (elem->aux == prog_aux) goto out; } elem = kmalloc(sizeof(*elem), GFP_KERNEL); if (!elem) { ret = -ENOMEM; goto out; } INIT_LIST_HEAD(&elem->list); /* We must track the program's aux info at this point in time * since the program pointer itself may not be stable yet, see * also comment in prog_array_map_poke_run(). */ elem->aux = prog_aux; list_add_tail(&elem->list, &aux->poke_progs); out: mutex_unlock(&aux->poke_mutex); return ret; } static void prog_array_map_poke_untrack(struct bpf_map *map, struct bpf_prog_aux *prog_aux) { struct prog_poke_elem *elem, *tmp; struct bpf_array_aux *aux; aux = container_of(map, struct bpf_array, map)->aux; mutex_lock(&aux->poke_mutex); list_for_each_entry_safe(elem, tmp, &aux->poke_progs, list) { if (elem->aux == prog_aux) { list_del_init(&elem->list); kfree(elem); break; } } mutex_unlock(&aux->poke_mutex); } void __weak bpf_arch_poke_desc_update(struct bpf_jit_poke_descriptor *poke, struct bpf_prog *new, struct bpf_prog *old) { WARN_ON_ONCE(1); } static void prog_array_map_poke_run(struct bpf_map *map, u32 key, struct bpf_prog *old, struct bpf_prog *new) { struct prog_poke_elem *elem; struct bpf_array_aux *aux; aux = container_of(map, struct bpf_array, map)->aux; WARN_ON_ONCE(!mutex_is_locked(&aux->poke_mutex)); list_for_each_entry(elem, &aux->poke_progs, list) { struct bpf_jit_poke_descriptor *poke; int i; for (i = 0; i < elem->aux->size_poke_tab; i++) { poke = &elem->aux->poke_tab[i]; /* Few things to be aware of: * * 1) We can only ever access aux in this context, but * not aux->prog since it might not be stable yet and * there could be danger of use after free otherwise. * 2) Initially when we start tracking aux, the program * is not JITed yet and also does not have a kallsyms * entry. We skip these as poke->tailcall_target_stable * is not active yet. The JIT will do the final fixup * before setting it stable. The various * poke->tailcall_target_stable are successively * activated, so tail call updates can arrive from here * while JIT is still finishing its final fixup for * non-activated poke entries. * 3) Also programs reaching refcount of zero while patching * is in progress is okay since we're protected under * poke_mutex and untrack the programs before the JIT * buffer is freed. */ if (!READ_ONCE(poke->tailcall_target_stable)) continue; if (poke->reason != BPF_POKE_REASON_TAIL_CALL) continue; if (poke->tail_call.map != map || poke->tail_call.key != key) continue; bpf_arch_poke_desc_update(poke, new, old); } } } static void prog_array_map_clear_deferred(struct work_struct *work) { struct bpf_map *map = container_of(work, struct bpf_array_aux, work)->map; bpf_fd_array_map_clear(map, true); bpf_map_put(map); } static void prog_array_map_clear(struct bpf_map *map) { struct bpf_array_aux *aux = container_of(map, struct bpf_array, map)->aux; bpf_map_inc(map); schedule_work(&aux->work); } static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr) { struct bpf_array_aux *aux; struct bpf_map *map; aux = kzalloc(sizeof(*aux), GFP_KERNEL_ACCOUNT); if (!aux) return ERR_PTR(-ENOMEM); INIT_WORK(&aux->work, prog_array_map_clear_deferred); INIT_LIST_HEAD(&aux->poke_progs); mutex_init(&aux->poke_mutex); map = array_map_alloc(attr); if (IS_ERR(map)) { kfree(aux); return map; } container_of(map, struct bpf_array, map)->aux = aux; aux->map = map; return map; } static void prog_array_map_free(struct bpf_map *map) { struct prog_poke_elem *elem, *tmp; struct bpf_array_aux *aux; aux = container_of(map, struct bpf_array, map)->aux; list_for_each_entry_safe(elem, tmp, &aux->poke_progs, list) { list_del_init(&elem->list); kfree(elem); } kfree(aux); fd_array_map_free(map); } /* prog_array->aux->{type,jited} is a runtime binding. * Doing static check alone in the verifier is not enough. * Thus, prog_array_map cannot be used as an inner_map * and map_meta_equal is not implemented. */ const struct bpf_map_ops prog_array_map_ops = { .map_alloc_check = fd_array_map_alloc_check, .map_alloc = prog_array_map_alloc, .map_free = prog_array_map_free, .map_poke_track = prog_array_map_poke_track, .map_poke_untrack = prog_array_map_poke_untrack, .map_poke_run = prog_array_map_poke_run, .map_get_next_key = array_map_get_next_key, .map_lookup_elem = fd_array_map_lookup_elem, .map_delete_elem = fd_array_map_delete_elem, .map_fd_get_ptr = prog_fd_array_get_ptr, .map_fd_put_ptr = prog_fd_array_put_ptr, .map_fd_sys_lookup_elem = prog_fd_array_sys_lookup_elem, .map_release_uref = prog_array_map_clear, .map_seq_show_elem = prog_array_map_seq_show_elem, .map_mem_usage = array_map_mem_usage, .map_btf_id = &array_map_btf_ids[0], }; static struct bpf_event_entry *bpf_event_entry_gen(struct file *perf_file, struct file *map_file) { struct bpf_event_entry *ee; ee = kzalloc(sizeof(*ee), GFP_KERNEL); if (ee) { ee->event = perf_file->private_data; ee->perf_file = perf_file; ee->map_file = map_file; } return ee; } static void __bpf_event_entry_free(struct rcu_head *rcu) { struct bpf_event_entry *ee; ee = container_of(rcu, struct bpf_event_entry, rcu); fput(ee->perf_file); kfree(ee); } static void bpf_event_entry_free_rcu(struct bpf_event_entry *ee) { call_rcu(&ee->rcu, __bpf_event_entry_free); } static void *perf_event_fd_array_get_ptr(struct bpf_map *map, struct file *map_file, int fd) { struct bpf_event_entry *ee; struct perf_event *event; struct file *perf_file; u64 value; perf_file = perf_event_get(fd); if (IS_ERR(perf_file)) return perf_file; ee = ERR_PTR(-EOPNOTSUPP); event = perf_file->private_data; if (perf_event_read_local(event, &value, NULL, NULL) == -EOPNOTSUPP) goto err_out; ee = bpf_event_entry_gen(perf_file, map_file); if (ee) return ee; ee = ERR_PTR(-ENOMEM); err_out: fput(perf_file); return ee; } static void perf_event_fd_array_put_ptr(struct bpf_map *map, void *ptr, bool need_defer) { /* bpf_perf_event is freed after one RCU grace period */ bpf_event_entry_free_rcu(ptr); } static void perf_event_fd_array_release(struct bpf_map *map, struct file *map_file) { struct bpf_array *array = container_of(map, struct bpf_array, map); struct bpf_event_entry *ee; int i; if (map->map_flags & BPF_F_PRESERVE_ELEMS) return; rcu_read_lock(); for (i = 0; i < array->map.max_entries; i++) { ee = READ_ONCE(array->ptrs[i]); if (ee && ee->map_file == map_file) __fd_array_map_delete_elem(map, &i, true); } rcu_read_unlock(); } static void perf_event_fd_array_map_free(struct bpf_map *map) { if (map->map_flags & BPF_F_PRESERVE_ELEMS) bpf_fd_array_map_clear(map, false); fd_array_map_free(map); } const struct bpf_map_ops perf_event_array_map_ops = { .map_meta_equal = bpf_map_meta_equal, .map_alloc_check = fd_array_map_alloc_check, .map_alloc = array_map_alloc, .map_free = perf_event_fd_array_map_free, .map_get_next_key = array_map_get_next_key, .map_lookup_elem = fd_array_map_lookup_elem, .map_delete_elem = fd_array_map_delete_elem, .map_fd_get_ptr = perf_event_fd_array_get_ptr, .map_fd_put_ptr = perf_event_fd_array_put_ptr, .map_release = perf_event_fd_array_release, .map_check_btf = map_check_no_btf, .map_mem_usage = array_map_mem_usage, .map_btf_id = &array_map_btf_ids[0], }; #ifdef CONFIG_CGROUPS static void *cgroup_fd_array_get_ptr(struct bpf_map *map, struct file *map_file /* not used */, int fd) { return cgroup_get_from_fd(fd); } static void cgroup_fd_array_put_ptr(struct bpf_map *map, void *ptr, bool need_defer) { /* cgroup_put free cgrp after a rcu grace period */ cgroup_put(ptr); } static void cgroup_fd_array_free(struct bpf_map *map) { bpf_fd_array_map_clear(map, false); fd_array_map_free(map); } const struct bpf_map_ops cgroup_array_map_ops = { .map_meta_equal = bpf_map_meta_equal, .map_alloc_check = fd_array_map_alloc_check, .map_alloc = array_map_alloc, .map_free = cgroup_fd_array_free, .map_get_next_key = array_map_get_next_key, .map_lookup_elem = fd_array_map_lookup_elem, .map_delete_elem = fd_array_map_delete_elem, .map_fd_get_ptr = cgroup_fd_array_get_ptr, .map_fd_put_ptr = cgroup_fd_array_put_ptr, .map_check_btf = map_check_no_btf, .map_mem_usage = array_map_mem_usage, .map_btf_id = &array_map_btf_ids[0], }; #endif static struct bpf_map *array_of_map_alloc(union bpf_attr *attr) { struct bpf_map *map, *inner_map_meta; inner_map_meta = bpf_map_meta_alloc(attr->inner_map_fd); if (IS_ERR(inner_map_meta)) return inner_map_meta; map = array_map_alloc(attr); if (IS_ERR(map)) { bpf_map_meta_free(inner_map_meta); return map; } map->inner_map_meta = inner_map_meta; return map; } static void array_of_map_free(struct bpf_map *map) { /* map->inner_map_meta is only accessed by syscall which * is protected by fdget/fdput. */ bpf_map_meta_free(map->inner_map_meta); bpf_fd_array_map_clear(map, false); fd_array_map_free(map); } static void *array_of_map_lookup_elem(struct bpf_map *map, void *key) { struct bpf_map **inner_map = array_map_lookup_elem(map, key); if (!inner_map) return NULL; return READ_ONCE(*inner_map); } static int array_of_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf) { struct bpf_array *array = container_of(map, struct bpf_array, map); u32 elem_size = array->elem_size; struct bpf_insn *insn = insn_buf; const int ret = BPF_REG_0; const int map_ptr = BPF_REG_1; const int index = BPF_REG_2; *insn++ = BPF_ALU64_IMM(BPF_ADD, map_ptr, offsetof(struct bpf_array, value)); *insn++ = BPF_LDX_MEM(BPF_W, ret, index, 0); if (!map->bypass_spec_v1) { *insn++ = BPF_JMP_IMM(BPF_JGE, ret, map->max_entries, 6); *insn++ = BPF_ALU32_IMM(BPF_AND, ret, array->index_mask); } else { *insn++ = BPF_JMP_IMM(BPF_JGE, ret, map->max_entries, 5); } if (is_power_of_2(elem_size)) *insn++ = BPF_ALU64_IMM(BPF_LSH, ret, ilog2(elem_size)); else *insn++ = BPF_ALU64_IMM(BPF_MUL, ret, elem_size); *insn++ = BPF_ALU64_REG(BPF_ADD, ret, map_ptr); *insn++ = BPF_LDX_MEM(BPF_DW, ret, ret, 0); *insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 1); *insn++ = BPF_JMP_IMM(BPF_JA, 0, 0, 1); *insn++ = BPF_MOV64_IMM(ret, 0); return insn - insn_buf; } const struct bpf_map_ops array_of_maps_map_ops = { .map_alloc_check = fd_array_map_alloc_check, .map_alloc = array_of_map_alloc, .map_free = array_of_map_free, .map_get_next_key = array_map_get_next_key, .map_lookup_elem = array_of_map_lookup_elem, .map_delete_elem = fd_array_map_delete_elem, .map_fd_get_ptr = bpf_map_fd_get_ptr, .map_fd_put_ptr = bpf_map_fd_put_ptr, .map_fd_sys_lookup_elem = bpf_map_fd_sys_lookup_elem, .map_gen_lookup = array_of_map_gen_lookup, .map_lookup_batch = generic_map_lookup_batch, .map_update_batch = generic_map_update_batch, .map_check_btf = map_check_no_btf, .map_mem_usage = array_map_mem_usage, .map_btf_id = &array_map_btf_ids[0], }; |
| 1 7 24 63 3 97 1 91 17 1 14 3 2 6 108 108 93 15 97 61 36 46 16 30 8 2 2 2 8 89 95 5 59 46 106 105 94 11 97 82 24 6 18 18 25 37 37 6 6 6 6 5 1 2 2 1 1 1 15 1 2 3 2 2 5 11 8 3 121 2 2 1 1 90 25 68 117 1 2 1 113 77 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 | // SPDX-License-Identifier: GPL-2.0-only /* * (C) 2012-2013 by Pablo Neira Ayuso <pablo@netfilter.org> * * This software has been sponsored by Sophos Astaro <http://www.sophos.com> */ #include <linux/kernel.h> #include <linux/init.h> #include <linux/module.h> #include <linux/netlink.h> #include <linux/netfilter.h> #include <linux/netfilter/nfnetlink.h> #include <linux/netfilter/nf_tables.h> #include <linux/netfilter/nf_tables_compat.h> #include <linux/netfilter/x_tables.h> #include <linux/netfilter_ipv4/ip_tables.h> #include <linux/netfilter_ipv6/ip6_tables.h> #include <linux/netfilter_bridge/ebtables.h> #include <linux/netfilter_arp/arp_tables.h> #include <net/netfilter/nf_tables.h> #include <net/netfilter/nf_log.h> /* Used for matches where *info is larger than X byte */ #define NFT_MATCH_LARGE_THRESH 192 struct nft_xt_match_priv { void *info; }; static int nft_compat_chain_validate_dependency(const struct nft_ctx *ctx, const char *tablename) { enum nft_chain_types type = NFT_CHAIN_T_DEFAULT; const struct nft_chain *chain = ctx->chain; const struct nft_base_chain *basechain; if (!tablename || !nft_is_base_chain(chain)) return 0; basechain = nft_base_chain(chain); if (strcmp(tablename, "nat") == 0) { if (ctx->family != NFPROTO_BRIDGE) type = NFT_CHAIN_T_NAT; if (basechain->type->type != type) return -EINVAL; } return 0; } union nft_entry { struct ipt_entry e4; struct ip6t_entry e6; struct ebt_entry ebt; struct arpt_entry arp; }; static inline void nft_compat_set_par(struct xt_action_param *par, const struct nft_pktinfo *pkt, const void *xt, const void *xt_info) { par->state = pkt->state; par->thoff = nft_thoff(pkt); par->fragoff = pkt->fragoff; par->target = xt; par->targinfo = xt_info; par->hotdrop = false; } static void nft_target_eval_xt(const struct nft_expr *expr, struct nft_regs *regs, const struct nft_pktinfo *pkt) { void *info = nft_expr_priv(expr); struct xt_target *target = expr->ops->data; struct sk_buff *skb = pkt->skb; struct xt_action_param xt; int ret; nft_compat_set_par(&xt, pkt, target, info); ret = target->target(skb, &xt); if (xt.hotdrop) ret = NF_DROP; switch (ret) { case XT_CONTINUE: regs->verdict.code = NFT_CONTINUE; break; default: regs->verdict.code = ret; break; } } static void nft_target_eval_bridge(const struct nft_expr *expr, struct nft_regs *regs, const struct nft_pktinfo *pkt) { void *info = nft_expr_priv(expr); struct xt_target *target = expr->ops->data; struct sk_buff *skb = pkt->skb; struct xt_action_param xt; int ret; nft_compat_set_par(&xt, pkt, target, info); ret = target->target(skb, &xt); if (xt.hotdrop) ret = NF_DROP; switch (ret) { case EBT_ACCEPT: regs->verdict.code = NF_ACCEPT; break; case EBT_DROP: regs->verdict.code = NF_DROP; break; case EBT_CONTINUE: regs->verdict.code = NFT_CONTINUE; break; case EBT_RETURN: regs->verdict.code = NFT_RETURN; break; default: regs->verdict.code = ret; break; } } static const struct nla_policy nft_target_policy[NFTA_TARGET_MAX + 1] = { [NFTA_TARGET_NAME] = { .type = NLA_NUL_STRING }, [NFTA_TARGET_REV] = NLA_POLICY_MAX(NLA_BE32, 255), [NFTA_TARGET_INFO] = { .type = NLA_BINARY }, }; static void nft_target_set_tgchk_param(struct xt_tgchk_param *par, const struct nft_ctx *ctx, struct xt_target *target, void *info, union nft_entry *entry, u16 proto, bool inv) { par->net = ctx->net; par->table = ctx->table->name; switch (ctx->family) { case AF_INET: entry->e4.ip.proto = proto; entry->e4.ip.invflags = inv ? IPT_INV_PROTO : 0; break; case AF_INET6: if (proto) entry->e6.ipv6.flags |= IP6T_F_PROTO; entry->e6.ipv6.proto = proto; entry->e6.ipv6.invflags = inv ? IP6T_INV_PROTO : 0; break; case NFPROTO_BRIDGE: entry->ebt.ethproto = (__force __be16)proto; entry->ebt.invflags = inv ? EBT_IPROTO : 0; break; case NFPROTO_ARP: break; } par->entryinfo = entry; par->target = target; par->targinfo = info; if (nft_is_base_chain(ctx->chain)) { const struct nft_base_chain *basechain = nft_base_chain(ctx->chain); const struct nf_hook_ops *ops = &basechain->ops; par->hook_mask = 1 << ops->hooknum; } else { par->hook_mask = 0; } par->family = ctx->family; par->nft_compat = true; } static void target_compat_from_user(struct xt_target *t, void *in, void *out) { int pad; memcpy(out, in, t->targetsize); pad = XT_ALIGN(t->targetsize) - t->targetsize; if (pad > 0) memset(out + t->targetsize, 0, pad); } static const struct nla_policy nft_rule_compat_policy[NFTA_RULE_COMPAT_MAX + 1] = { [NFTA_RULE_COMPAT_PROTO] = { .type = NLA_U32 }, [NFTA_RULE_COMPAT_FLAGS] = { .type = NLA_U32 }, }; static int nft_parse_compat(const struct nlattr *attr, u16 *proto, bool *inv) { struct nlattr *tb[NFTA_RULE_COMPAT_MAX+1]; u32 l4proto; u32 flags; int err; err = nla_parse_nested_deprecated(tb, NFTA_RULE_COMPAT_MAX, attr, nft_rule_compat_policy, NULL); if (err < 0) return err; if (!tb[NFTA_RULE_COMPAT_PROTO] || !tb[NFTA_RULE_COMPAT_FLAGS]) return -EINVAL; flags = ntohl(nla_get_be32(tb[NFTA_RULE_COMPAT_FLAGS])); if (flags & NFT_RULE_COMPAT_F_UNUSED || flags & ~NFT_RULE_COMPAT_F_MASK) return -EINVAL; if (flags & NFT_RULE_COMPAT_F_INV) *inv = true; l4proto = ntohl(nla_get_be32(tb[NFTA_RULE_COMPAT_PROTO])); if (l4proto > U16_MAX) return -EINVAL; *proto = l4proto; return 0; } static void nft_compat_wait_for_destructors(struct net *net) { /* xtables matches or targets can have side effects, e.g. * creation/destruction of /proc files. * The xt ->destroy functions are run asynchronously from * work queue. If we have pending invocations we thus * need to wait for those to finish. */ nf_tables_trans_destroy_flush_work(net); } static int nft_target_init(const struct nft_ctx *ctx, const struct nft_expr *expr, const struct nlattr * const tb[]) { void *info = nft_expr_priv(expr); struct xt_target *target = expr->ops->data; struct xt_tgchk_param par; size_t size = XT_ALIGN(nla_len(tb[NFTA_TARGET_INFO])); u16 proto = 0; bool inv = false; union nft_entry e = {}; int ret; target_compat_from_user(target, nla_data(tb[NFTA_TARGET_INFO]), info); if (ctx->nla[NFTA_RULE_COMPAT]) { ret = nft_parse_compat(ctx->nla[NFTA_RULE_COMPAT], &proto, &inv); if (ret < 0) return ret; } nft_target_set_tgchk_param(&par, ctx, target, info, &e, proto, inv); nft_compat_wait_for_destructors(ctx->net); ret = xt_check_target(&par, size, proto, inv); if (ret < 0) { if (ret == -ENOENT) { const char *modname = NULL; if (strcmp(target->name, "LOG") == 0) modname = "nf_log_syslog"; else if (strcmp(target->name, "NFLOG") == 0) modname = "nfnetlink_log"; if (modname && nft_request_module(ctx->net, "%s", modname) == -EAGAIN) return -EAGAIN; } return ret; } /* The standard target cannot be used */ if (!target->target) return -EINVAL; return 0; } static void __nft_mt_tg_destroy(struct module *me, const struct nft_expr *expr) { module_put(me); kfree(expr->ops); } static void nft_target_destroy(const struct nft_ctx *ctx, const struct nft_expr *expr) { struct xt_target *target = expr->ops->data; void *info = nft_expr_priv(expr); struct module *me = target->me; struct xt_tgdtor_param par; par.net = ctx->net; par.target = target; par.targinfo = info; par.family = ctx->family; if (par.target->destroy != NULL) par.target->destroy(&par); __nft_mt_tg_destroy(me, expr); } static int nft_extension_dump_info(struct sk_buff *skb, int attr, const void *info, unsigned int size, unsigned int user_size) { unsigned int info_size, aligned_size = XT_ALIGN(size); struct nlattr *nla; nla = nla_reserve(skb, attr, aligned_size); if (!nla) return -1; info_size = user_size ? : size; memcpy(nla_data(nla), info, info_size); memset(nla_data(nla) + info_size, 0, aligned_size - info_size); return 0; } static int nft_target_dump(struct sk_buff *skb, const struct nft_expr *expr, bool reset) { const struct xt_target *target = expr->ops->data; void *info = nft_expr_priv(expr); if (nla_put_string(skb, NFTA_TARGET_NAME, target->name) || nla_put_be32(skb, NFTA_TARGET_REV, htonl(target->revision)) || nft_extension_dump_info(skb, NFTA_TARGET_INFO, info, target->targetsize, target->usersize)) goto nla_put_failure; return 0; nla_put_failure: return -1; } static int nft_target_validate(const struct nft_ctx *ctx, const struct nft_expr *expr) { struct xt_target *target = expr->ops->data; unsigned int hook_mask = 0; int ret; if (ctx->family != NFPROTO_IPV4 && ctx->family != NFPROTO_IPV6 && ctx->family != NFPROTO_INET && ctx->family != NFPROTO_BRIDGE && ctx->family != NFPROTO_ARP) return -EOPNOTSUPP; ret = nft_chain_validate_hooks(ctx->chain, (1 << NF_INET_PRE_ROUTING) | (1 << NF_INET_LOCAL_IN) | (1 << NF_INET_FORWARD) | (1 << NF_INET_LOCAL_OUT) | (1 << NF_INET_POST_ROUTING)); if (ret) return ret; if (nft_is_base_chain(ctx->chain)) { const struct nft_base_chain *basechain = nft_base_chain(ctx->chain); const struct nf_hook_ops *ops = &basechain->ops; hook_mask = 1 << ops->hooknum; if (target->hooks && !(hook_mask & target->hooks)) return -EINVAL; ret = nft_compat_chain_validate_dependency(ctx, target->table); if (ret < 0) return ret; } return 0; } static void __nft_match_eval(const struct nft_expr *expr, struct nft_regs *regs, const struct nft_pktinfo *pkt, void *info) { struct xt_match *match = expr->ops->data; struct sk_buff *skb = pkt->skb; struct xt_action_param xt; bool ret; nft_compat_set_par(&xt, pkt, match, info); ret = match->match(skb, &xt); if (xt.hotdrop) { regs->verdict.code = NF_DROP; return; } switch (ret ? 1 : 0) { case 1: regs->verdict.code = NFT_CONTINUE; break; case 0: regs->verdict.code = NFT_BREAK; break; } } static void nft_match_large_eval(const struct nft_expr *expr, struct nft_regs *regs, const struct nft_pktinfo *pkt) { struct nft_xt_match_priv *priv = nft_expr_priv(expr); __nft_match_eval(expr, regs, pkt, priv->info); } static void nft_match_eval(const struct nft_expr *expr, struct nft_regs *regs, const struct nft_pktinfo *pkt) { __nft_match_eval(expr, regs, pkt, nft_expr_priv(expr)); } static const struct nla_policy nft_match_policy[NFTA_MATCH_MAX + 1] = { [NFTA_MATCH_NAME] = { .type = NLA_NUL_STRING }, [NFTA_MATCH_REV] = NLA_POLICY_MAX(NLA_BE32, 255), [NFTA_MATCH_INFO] = { .type = NLA_BINARY }, }; /* struct xt_mtchk_param and xt_tgchk_param look very similar */ static void nft_match_set_mtchk_param(struct xt_mtchk_param *par, const struct nft_ctx *ctx, struct xt_match *match, void *info, union nft_entry *entry, u16 proto, bool inv) { par->net = ctx->net; par->table = ctx->table->name; switch (ctx->family) { case AF_INET: entry->e4.ip.proto = proto; entry->e4.ip.invflags = inv ? IPT_INV_PROTO : 0; break; case AF_INET6: if (proto) entry->e6.ipv6.flags |= IP6T_F_PROTO; entry->e6.ipv6.proto = proto; entry->e6.ipv6.invflags = inv ? IP6T_INV_PROTO : 0; break; case NFPROTO_BRIDGE: entry->ebt.ethproto = (__force __be16)proto; entry->ebt.invflags = inv ? EBT_IPROTO : 0; break; case NFPROTO_ARP: break; } par->entryinfo = entry; par->match = match; par->matchinfo = info; if (nft_is_base_chain(ctx->chain)) { const struct nft_base_chain *basechain = nft_base_chain(ctx->chain); const struct nf_hook_ops *ops = &basechain->ops; par->hook_mask = 1 << ops->hooknum; } else { par->hook_mask = 0; } par->family = ctx->family; par->nft_compat = true; } static void match_compat_from_user(struct xt_match *m, void *in, void *out) { int pad; memcpy(out, in, m->matchsize); pad = XT_ALIGN(m->matchsize) - m->matchsize; if (pad > 0) memset(out + m->matchsize, 0, pad); } static int __nft_match_init(const struct nft_ctx *ctx, const struct nft_expr *expr, const struct nlattr * const tb[], void *info) { struct xt_match *match = expr->ops->data; struct xt_mtchk_param par; size_t size = XT_ALIGN(nla_len(tb[NFTA_MATCH_INFO])); u16 proto = 0; bool inv = false; union nft_entry e = {}; int ret; match_compat_from_user(match, nla_data(tb[NFTA_MATCH_INFO]), info); if (ctx->nla[NFTA_RULE_COMPAT]) { ret = nft_parse_compat(ctx->nla[NFTA_RULE_COMPAT], &proto, &inv); if (ret < 0) return ret; } nft_match_set_mtchk_param(&par, ctx, match, info, &e, proto, inv); nft_compat_wait_for_destructors(ctx->net); return xt_check_match(&par, size, proto, inv); } static int nft_match_init(const struct nft_ctx *ctx, const struct nft_expr *expr, const struct nlattr * const tb[]) { return __nft_match_init(ctx, expr, tb, nft_expr_priv(expr)); } static int nft_match_large_init(const struct nft_ctx *ctx, const struct nft_expr *expr, const struct nlattr * const tb[]) { struct nft_xt_match_priv *priv = nft_expr_priv(expr); struct xt_match *m = expr->ops->data; int ret; priv->info = kmalloc(XT_ALIGN(m->matchsize), GFP_KERNEL_ACCOUNT); if (!priv->info) return -ENOMEM; ret = __nft_match_init(ctx, expr, tb, priv->info); if (ret) kfree(priv->info); return ret; } static void __nft_match_destroy(const struct nft_ctx *ctx, const struct nft_expr *expr, void *info) { struct xt_match *match = expr->ops->data; struct module *me = match->me; struct xt_mtdtor_param par; par.net = ctx->net; par.match = match; par.matchinfo = info; par.family = ctx->family; if (par.match->destroy != NULL) par.match->destroy(&par); __nft_mt_tg_destroy(me, expr); } static void nft_match_destroy(const struct nft_ctx *ctx, const struct nft_expr *expr) { __nft_match_destroy(ctx, expr, nft_expr_priv(expr)); } static void nft_match_large_destroy(const struct nft_ctx *ctx, const struct nft_expr *expr) { struct nft_xt_match_priv *priv = nft_expr_priv(expr); __nft_match_destroy(ctx, expr, priv->info); kfree(priv->info); } static int __nft_match_dump(struct sk_buff *skb, const struct nft_expr *expr, void *info) { struct xt_match *match = expr->ops->data; if (nla_put_string(skb, NFTA_MATCH_NAME, match->name) || nla_put_be32(skb, NFTA_MATCH_REV, htonl(match->revision)) || nft_extension_dump_info(skb, NFTA_MATCH_INFO, info, match->matchsize, match->usersize)) goto nla_put_failure; return 0; nla_put_failure: return -1; } static int nft_match_dump(struct sk_buff *skb, const struct nft_expr *expr, bool reset) { return __nft_match_dump(skb, expr, nft_expr_priv(expr)); } static int nft_match_large_dump(struct sk_buff *skb, const struct nft_expr *e, bool reset) { struct nft_xt_match_priv *priv = nft_expr_priv(e); return __nft_match_dump(skb, e, priv->info); } static int nft_match_validate(const struct nft_ctx *ctx, const struct nft_expr *expr) { struct xt_match *match = expr->ops->data; unsigned int hook_mask = 0; int ret; if (ctx->family != NFPROTO_IPV4 && ctx->family != NFPROTO_IPV6 && ctx->family != NFPROTO_INET && ctx->family != NFPROTO_BRIDGE && ctx->family != NFPROTO_ARP) return -EOPNOTSUPP; ret = nft_chain_validate_hooks(ctx->chain, (1 << NF_INET_PRE_ROUTING) | (1 << NF_INET_LOCAL_IN) | (1 << NF_INET_FORWARD) | (1 << NF_INET_LOCAL_OUT) | (1 << NF_INET_POST_ROUTING)); if (ret) return ret; if (nft_is_base_chain(ctx->chain)) { const struct nft_base_chain *basechain = nft_base_chain(ctx->chain); const struct nf_hook_ops *ops = &basechain->ops; hook_mask = 1 << ops->hooknum; if (match->hooks && !(hook_mask & match->hooks)) return -EINVAL; ret = nft_compat_chain_validate_dependency(ctx, match->table); if (ret < 0) return ret; } return 0; } static int nfnl_compat_fill_info(struct sk_buff *skb, u32 portid, u32 seq, u32 type, int event, u16 family, const char *name, int rev, int target) { struct nlmsghdr *nlh; unsigned int flags = portid ? NLM_F_MULTI : 0; event = nfnl_msg_type(NFNL_SUBSYS_NFT_COMPAT, event); nlh = nfnl_msg_put(skb, portid, seq, event, flags, family, NFNETLINK_V0, 0); if (!nlh) goto nlmsg_failure; if (nla_put_string(skb, NFTA_COMPAT_NAME, name) || nla_put_be32(skb, NFTA_COMPAT_REV, htonl(rev)) || nla_put_be32(skb, NFTA_COMPAT_TYPE, htonl(target))) goto nla_put_failure; nlmsg_end(skb, nlh); return skb->len; nlmsg_failure: nla_put_failure: nlmsg_cancel(skb, nlh); return -1; } static int nfnl_compat_get_rcu(struct sk_buff *skb, const struct nfnl_info *info, const struct nlattr * const tb[]) { u8 family = info->nfmsg->nfgen_family; const char *name, *fmt; struct sk_buff *skb2; int ret = 0, target; u32 rev; if (tb[NFTA_COMPAT_NAME] == NULL || tb[NFTA_COMPAT_REV] == NULL || tb[NFTA_COMPAT_TYPE] == NULL) return -EINVAL; name = nla_data(tb[NFTA_COMPAT_NAME]); rev = ntohl(nla_get_be32(tb[NFTA_COMPAT_REV])); target = ntohl(nla_get_be32(tb[NFTA_COMPAT_TYPE])); switch(family) { case AF_INET: fmt = "ipt_%s"; break; case AF_INET6: fmt = "ip6t_%s"; break; case NFPROTO_BRIDGE: fmt = "ebt_%s"; break; case NFPROTO_ARP: fmt = "arpt_%s"; break; default: pr_err("nft_compat: unsupported protocol %d\n", family); return -EINVAL; } if (!try_module_get(THIS_MODULE)) return -EINVAL; rcu_read_unlock(); try_then_request_module(xt_find_revision(family, name, rev, target, &ret), fmt, name); if (ret < 0) goto out_put; skb2 = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (skb2 == NULL) { ret = -ENOMEM; goto out_put; } /* include the best revision for this extension in the message */ if (nfnl_compat_fill_info(skb2, NETLINK_CB(skb).portid, info->nlh->nlmsg_seq, NFNL_MSG_TYPE(info->nlh->nlmsg_type), NFNL_MSG_COMPAT_GET, family, name, ret, target) <= 0) { kfree_skb(skb2); goto out_put; } ret = nfnetlink_unicast(skb2, info->net, NETLINK_CB(skb).portid); out_put: rcu_read_lock(); module_put(THIS_MODULE); return ret; } static const struct nla_policy nfnl_compat_policy_get[NFTA_COMPAT_MAX+1] = { [NFTA_COMPAT_NAME] = { .type = NLA_NUL_STRING, .len = NFT_COMPAT_NAME_MAX-1 }, [NFTA_COMPAT_REV] = NLA_POLICY_MAX(NLA_BE32, 255), [NFTA_COMPAT_TYPE] = { .type = NLA_U32 }, }; static const struct nfnl_callback nfnl_nft_compat_cb[NFNL_MSG_COMPAT_MAX] = { [NFNL_MSG_COMPAT_GET] = { .call = nfnl_compat_get_rcu, .type = NFNL_CB_RCU, .attr_count = NFTA_COMPAT_MAX, .policy = nfnl_compat_policy_get }, }; static const struct nfnetlink_subsystem nfnl_compat_subsys = { .name = "nft-compat", .subsys_id = NFNL_SUBSYS_NFT_COMPAT, .cb_count = NFNL_MSG_COMPAT_MAX, .cb = nfnl_nft_compat_cb, }; static struct nft_expr_type nft_match_type; static bool nft_match_reduce(struct nft_regs_track *track, const struct nft_expr *expr) { const struct xt_match *match = expr->ops->data; return strcmp(match->name, "comment") == 0; } static const struct nft_expr_ops * nft_match_select_ops(const struct nft_ctx *ctx, const struct nlattr * const tb[]) { struct nft_expr_ops *ops; struct xt_match *match; unsigned int matchsize; char *mt_name; u32 rev, family; int err; if (tb[NFTA_MATCH_NAME] == NULL || tb[NFTA_MATCH_REV] == NULL || tb[NFTA_MATCH_INFO] == NULL) return ERR_PTR(-EINVAL); mt_name = nla_data(tb[NFTA_MATCH_NAME]); rev = ntohl(nla_get_be32(tb[NFTA_MATCH_REV])); family = ctx->family; match = xt_request_find_match(family, mt_name, rev); if (IS_ERR(match)) return ERR_PTR(-ENOENT); if (match->matchsize > nla_len(tb[NFTA_MATCH_INFO])) { err = -EINVAL; goto err; } ops = kzalloc(sizeof(struct nft_expr_ops), GFP_KERNEL_ACCOUNT); if (!ops) { err = -ENOMEM; goto err; } ops->type = &nft_match_type; ops->eval = nft_match_eval; ops->init = nft_match_init; ops->destroy = nft_match_destroy; ops->dump = nft_match_dump; ops->validate = nft_match_validate; ops->data = match; ops->reduce = nft_match_reduce; matchsize = NFT_EXPR_SIZE(XT_ALIGN(match->matchsize)); if (matchsize > NFT_MATCH_LARGE_THRESH) { matchsize = NFT_EXPR_SIZE(sizeof(struct nft_xt_match_priv)); ops->eval = nft_match_large_eval; ops->init = nft_match_large_init; ops->destroy = nft_match_large_destroy; ops->dump = nft_match_large_dump; } ops->size = matchsize; return ops; err: module_put(match->me); return ERR_PTR(err); } static void nft_match_release_ops(const struct nft_expr_ops *ops) { struct xt_match *match = ops->data; module_put(match->me); kfree(ops); } static struct nft_expr_type nft_match_type __read_mostly = { .name = "match", .select_ops = nft_match_select_ops, .release_ops = nft_match_release_ops, .policy = nft_match_policy, .maxattr = NFTA_MATCH_MAX, .owner = THIS_MODULE, }; static struct nft_expr_type nft_target_type; static const struct nft_expr_ops * nft_target_select_ops(const struct nft_ctx *ctx, const struct nlattr * const tb[]) { struct nft_expr_ops *ops; struct xt_target *target; char *tg_name; u32 rev, family; int err; if (tb[NFTA_TARGET_NAME] == NULL || tb[NFTA_TARGET_REV] == NULL || tb[NFTA_TARGET_INFO] == NULL) return ERR_PTR(-EINVAL); tg_name = nla_data(tb[NFTA_TARGET_NAME]); rev = ntohl(nla_get_be32(tb[NFTA_TARGET_REV])); family = ctx->family; if (strcmp(tg_name, XT_ERROR_TARGET) == 0 || strcmp(tg_name, XT_STANDARD_TARGET) == 0 || strcmp(tg_name, "standard") == 0) return ERR_PTR(-EINVAL); target = xt_request_find_target(family, tg_name, rev); if (IS_ERR(target)) return ERR_PTR(-ENOENT); if (!target->target) { err = -EINVAL; goto err; } if (target->targetsize > nla_len(tb[NFTA_TARGET_INFO])) { err = -EINVAL; goto err; } ops = kzalloc(sizeof(struct nft_expr_ops), GFP_KERNEL_ACCOUNT); if (!ops) { err = -ENOMEM; goto err; } ops->type = &nft_target_type; ops->size = NFT_EXPR_SIZE(XT_ALIGN(target->targetsize)); ops->init = nft_target_init; ops->destroy = nft_target_destroy; ops->dump = nft_target_dump; ops->validate = nft_target_validate; ops->data = target; ops->reduce = NFT_REDUCE_READONLY; if (family == NFPROTO_BRIDGE) ops->eval = nft_target_eval_bridge; else ops->eval = nft_target_eval_xt; return ops; err: module_put(target->me); return ERR_PTR(err); } static void nft_target_release_ops(const struct nft_expr_ops *ops) { struct xt_target *target = ops->data; module_put(target->me); kfree(ops); } static struct nft_expr_type nft_target_type __read_mostly = { .name = "target", .select_ops = nft_target_select_ops, .release_ops = nft_target_release_ops, .policy = nft_target_policy, .maxattr = NFTA_TARGET_MAX, .owner = THIS_MODULE, }; static int __init nft_compat_module_init(void) { int ret; ret = nft_register_expr(&nft_match_type); if (ret < 0) return ret; ret = nft_register_expr(&nft_target_type); if (ret < 0) goto err_match; ret = nfnetlink_subsys_register(&nfnl_compat_subsys); if (ret < 0) { pr_err("nft_compat: cannot register with nfnetlink.\n"); goto err_target; } return ret; err_target: nft_unregister_expr(&nft_target_type); err_match: nft_unregister_expr(&nft_match_type); return ret; } static void __exit nft_compat_module_exit(void) { nfnetlink_subsys_unregister(&nfnl_compat_subsys); nft_unregister_expr(&nft_target_type); nft_unregister_expr(&nft_match_type); } MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_NFT_COMPAT); module_init(nft_compat_module_init); module_exit(nft_compat_module_exit); MODULE_LICENSE("GPL"); MODULE_AUTHOR("Pablo Neira Ayuso <pablo@netfilter.org>"); MODULE_ALIAS_NFT_EXPR("match"); MODULE_ALIAS_NFT_EXPR("target"); MODULE_DESCRIPTION("x_tables over nftables support"); |
| 39 39 39 39 48 19 19 19 39 39 39 2 2 2 2 64 64 64 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 | // SPDX-License-Identifier: GPL-2.0-only #include <linux/etherdevice.h> #include <linux/if_tap.h> #include <linux/if_vlan.h> #include <linux/interrupt.h> #include <linux/nsproxy.h> #include <linux/compat.h> #include <linux/if_tun.h> #include <linux/module.h> #include <linux/skbuff.h> #include <linux/cache.h> #include <linux/sched/signal.h> #include <linux/types.h> #include <linux/slab.h> #include <linux/wait.h> #include <linux/cdev.h> #include <linux/idr.h> #include <linux/fs.h> #include <linux/uio.h> #include <net/gso.h> #include <net/net_namespace.h> #include <net/rtnetlink.h> #include <net/sock.h> #include <net/xdp.h> #include <linux/virtio_net.h> #include <linux/skb_array.h> #include "tun_vnet.h" #define TAP_IFFEATURES (IFF_VNET_HDR | IFF_MULTI_QUEUE) static struct proto tap_proto = { .name = "tap", .owner = THIS_MODULE, .obj_size = sizeof(struct tap_queue), }; #define TAP_NUM_DEVS (1U << MINORBITS) static LIST_HEAD(major_list); struct major_info { struct rcu_head rcu; dev_t major; struct idr minor_idr; spinlock_t minor_lock; const char *device_name; struct list_head next; }; #define GOODCOPY_LEN 128 static const struct proto_ops tap_socket_ops; #define RX_OFFLOADS (NETIF_F_GRO | NETIF_F_LRO) #define TAP_FEATURES (NETIF_F_GSO | NETIF_F_SG | NETIF_F_FRAGLIST) static struct tap_dev *tap_dev_get_rcu(const struct net_device *dev) { return rcu_dereference(dev->rx_handler_data); } /* * RCU usage: * The tap_queue and the macvlan_dev are loosely coupled, the * pointers from one to the other can only be read while rcu_read_lock * or rtnl is held. * * Both the file and the macvlan_dev hold a reference on the tap_queue * through sock_hold(&q->sk). When the macvlan_dev goes away first, * q->vlan becomes inaccessible. When the files gets closed, * tap_get_queue() fails. * * There may still be references to the struct sock inside of the * queue from outbound SKBs, but these never reference back to the * file or the dev. The data structure is freed through __sk_free * when both our references and any pending SKBs are gone. */ static int tap_enable_queue(struct tap_dev *tap, struct file *file, struct tap_queue *q) { int err = -EINVAL; ASSERT_RTNL(); if (q->enabled) goto out; err = 0; rcu_assign_pointer(tap->taps[tap->numvtaps], q); q->queue_index = tap->numvtaps; q->enabled = true; tap->numvtaps++; out: return err; } /* Requires RTNL */ static int tap_set_queue(struct tap_dev *tap, struct file *file, struct tap_queue *q) { if (tap->numqueues == MAX_TAP_QUEUES) return -EBUSY; rcu_assign_pointer(q->tap, tap); rcu_assign_pointer(tap->taps[tap->numvtaps], q); sock_hold(&q->sk); q->file = file; q->queue_index = tap->numvtaps; q->enabled = true; file->private_data = q; list_add_tail(&q->next, &tap->queue_list); tap->numvtaps++; tap->numqueues++; return 0; } static int tap_disable_queue(struct tap_queue *q) { struct tap_dev *tap; struct tap_queue *nq; ASSERT_RTNL(); if (!q->enabled) return -EINVAL; tap = rtnl_dereference(q->tap); if (tap) { int index = q->queue_index; BUG_ON(index >= tap->numvtaps); nq = rtnl_dereference(tap->taps[tap->numvtaps - 1]); nq->queue_index = index; rcu_assign_pointer(tap->taps[index], nq); RCU_INIT_POINTER(tap->taps[tap->numvtaps - 1], NULL); q->enabled = false; tap->numvtaps--; } return 0; } /* * The file owning the queue got closed, give up both * the reference that the files holds as well as the * one from the macvlan_dev if that still exists. * * Using the spinlock makes sure that we don't get * to the queue again after destroying it. */ static void tap_put_queue(struct tap_queue *q) { struct tap_dev *tap; rtnl_lock(); tap = rtnl_dereference(q->tap); if (tap) { if (q->enabled) BUG_ON(tap_disable_queue(q)); tap->numqueues--; RCU_INIT_POINTER(q->tap, NULL); sock_put(&q->sk); list_del_init(&q->next); } rtnl_unlock(); synchronize_rcu(); sock_put(&q->sk); } /* * Select a queue based on the rxq of the device on which this packet * arrived. If the incoming device is not mq, calculate a flow hash * to select a queue. If all fails, find the first available queue. * Cache vlan->numvtaps since it can become zero during the execution * of this function. */ static struct tap_queue *tap_get_queue(struct tap_dev *tap, struct sk_buff *skb) { struct tap_queue *queue = NULL; /* Access to taps array is protected by rcu, but access to numvtaps * isn't. Below we use it to lookup a queue, but treat it as a hint * and validate that the result isn't NULL - in case we are * racing against queue removal. */ int numvtaps = READ_ONCE(tap->numvtaps); __u32 rxq; if (!numvtaps) goto out; if (numvtaps == 1) goto single; /* Check if we can use flow to select a queue */ rxq = skb_get_hash(skb); if (rxq) { queue = rcu_dereference(tap->taps[rxq % numvtaps]); goto out; } if (likely(skb_rx_queue_recorded(skb))) { rxq = skb_get_rx_queue(skb); while (unlikely(rxq >= numvtaps)) rxq -= numvtaps; queue = rcu_dereference(tap->taps[rxq]); goto out; } single: queue = rcu_dereference(tap->taps[0]); out: return queue; } /* * The net_device is going away, give up the reference * that it holds on all queues and safely set the pointer * from the queues to NULL. */ void tap_del_queues(struct tap_dev *tap) { struct tap_queue *q, *tmp; ASSERT_RTNL(); list_for_each_entry_safe(q, tmp, &tap->queue_list, next) { list_del_init(&q->next); RCU_INIT_POINTER(q->tap, NULL); if (q->enabled) tap->numvtaps--; tap->numqueues--; sock_put(&q->sk); } BUG_ON(tap->numvtaps); BUG_ON(tap->numqueues); /* guarantee that any future tap_set_queue will fail */ tap->numvtaps = MAX_TAP_QUEUES; } EXPORT_SYMBOL_GPL(tap_del_queues); rx_handler_result_t tap_handle_frame(struct sk_buff **pskb) { struct sk_buff *skb = *pskb; struct net_device *dev = skb->dev; struct tap_dev *tap; struct tap_queue *q; netdev_features_t features = TAP_FEATURES; enum skb_drop_reason drop_reason; tap = tap_dev_get_rcu(dev); if (!tap) return RX_HANDLER_PASS; q = tap_get_queue(tap, skb); if (!q) return RX_HANDLER_PASS; skb_push(skb, ETH_HLEN); /* Apply the forward feature mask so that we perform segmentation * according to users wishes. This only works if VNET_HDR is * enabled. */ if (q->flags & IFF_VNET_HDR) features |= tap->tap_features; if (netif_needs_gso(skb, features)) { struct sk_buff *segs = __skb_gso_segment(skb, features, false); struct sk_buff *next; if (IS_ERR(segs)) { drop_reason = SKB_DROP_REASON_SKB_GSO_SEG; goto drop; } if (!segs) { if (ptr_ring_produce(&q->ring, skb)) { drop_reason = SKB_DROP_REASON_FULL_RING; goto drop; } goto wake_up; } consume_skb(skb); skb_list_walk_safe(segs, skb, next) { skb_mark_not_on_list(skb); if (ptr_ring_produce(&q->ring, skb)) { drop_reason = SKB_DROP_REASON_FULL_RING; kfree_skb_reason(skb, drop_reason); kfree_skb_list_reason(next, drop_reason); break; } } } else { /* If we receive a partial checksum and the tap side * doesn't support checksum offload, compute the checksum. * Note: it doesn't matter which checksum feature to * check, we either support them all or none. */ if (skb->ip_summed == CHECKSUM_PARTIAL && !(features & NETIF_F_CSUM_MASK) && skb_checksum_help(skb)) { drop_reason = SKB_DROP_REASON_SKB_CSUM; goto drop; } if (ptr_ring_produce(&q->ring, skb)) { drop_reason = SKB_DROP_REASON_FULL_RING; goto drop; } } wake_up: wake_up_interruptible_poll(sk_sleep(&q->sk), EPOLLIN | EPOLLRDNORM | EPOLLRDBAND); return RX_HANDLER_CONSUMED; drop: /* Count errors/drops only here, thus don't care about args. */ if (tap->count_rx_dropped) tap->count_rx_dropped(tap); kfree_skb_reason(skb, drop_reason); return RX_HANDLER_CONSUMED; } EXPORT_SYMBOL_GPL(tap_handle_frame); static struct major_info *tap_get_major(int major) { struct major_info *tap_major; list_for_each_entry_rcu(tap_major, &major_list, next) { if (tap_major->major == major) return tap_major; } return NULL; } int tap_get_minor(dev_t major, struct tap_dev *tap) { int retval = -ENOMEM; struct major_info *tap_major; rcu_read_lock(); tap_major = tap_get_major(MAJOR(major)); if (!tap_major) { retval = -EINVAL; goto unlock; } spin_lock(&tap_major->minor_lock); retval = idr_alloc(&tap_major->minor_idr, tap, 1, TAP_NUM_DEVS, GFP_ATOMIC); if (retval >= 0) { tap->minor = retval; } else if (retval == -ENOSPC) { netdev_err(tap->dev, "Too many tap devices\n"); retval = -EINVAL; } spin_unlock(&tap_major->minor_lock); unlock: rcu_read_unlock(); return retval < 0 ? retval : 0; } EXPORT_SYMBOL_GPL(tap_get_minor); void tap_free_minor(dev_t major, struct tap_dev *tap) { struct major_info *tap_major; rcu_read_lock(); tap_major = tap_get_major(MAJOR(major)); if (!tap_major) { goto unlock; } spin_lock(&tap_major->minor_lock); if (tap->minor) { idr_remove(&tap_major->minor_idr, tap->minor); tap->minor = 0; } spin_unlock(&tap_major->minor_lock); unlock: rcu_read_unlock(); } EXPORT_SYMBOL_GPL(tap_free_minor); static struct tap_dev *dev_get_by_tap_file(int major, int minor) { struct net_device *dev = NULL; struct tap_dev *tap; struct major_info *tap_major; rcu_read_lock(); tap_major = tap_get_major(major); if (!tap_major) { tap = NULL; goto unlock; } spin_lock(&tap_major->minor_lock); tap = idr_find(&tap_major->minor_idr, minor); if (tap) { dev = tap->dev; dev_hold(dev); } spin_unlock(&tap_major->minor_lock); unlock: rcu_read_unlock(); return tap; } static void tap_sock_write_space(struct sock *sk) { wait_queue_head_t *wqueue; if (!sock_writeable(sk) || !test_and_clear_bit(SOCKWQ_ASYNC_NOSPACE, &sk->sk_socket->flags)) return; wqueue = sk_sleep(sk); if (wqueue && waitqueue_active(wqueue)) wake_up_interruptible_poll(wqueue, EPOLLOUT | EPOLLWRNORM | EPOLLWRBAND); } static void tap_sock_destruct(struct sock *sk) { struct tap_queue *q = container_of(sk, struct tap_queue, sk); ptr_ring_cleanup(&q->ring, __skb_array_destroy_skb); } static int tap_open(struct inode *inode, struct file *file) { struct net *net = current->nsproxy->net_ns; struct tap_dev *tap; struct tap_queue *q; int err = -ENODEV; rtnl_lock(); tap = dev_get_by_tap_file(imajor(inode), iminor(inode)); if (!tap) goto err; err = -ENOMEM; q = (struct tap_queue *)sk_alloc(net, AF_UNSPEC, GFP_KERNEL, &tap_proto, 0); if (!q) goto err; if (ptr_ring_init(&q->ring, tap->dev->tx_queue_len, GFP_KERNEL)) { sk_free(&q->sk); goto err; } init_waitqueue_head(&q->sock.wq.wait); q->sock.type = SOCK_RAW; q->sock.state = SS_CONNECTED; q->sock.file = file; q->sock.ops = &tap_socket_ops; sock_init_data_uid(&q->sock, &q->sk, current_fsuid()); q->sk.sk_write_space = tap_sock_write_space; q->sk.sk_destruct = tap_sock_destruct; q->flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP; q->vnet_hdr_sz = sizeof(struct virtio_net_hdr); /* * so far only KVM virtio_net uses tap, enable zero copy between * guest kernel and host kernel when lower device supports zerocopy * * The macvlan supports zerocopy iff the lower device supports zero * copy so we don't have to look at the lower device directly. */ if ((tap->dev->features & NETIF_F_HIGHDMA) && (tap->dev->features & NETIF_F_SG)) sock_set_flag(&q->sk, SOCK_ZEROCOPY); err = tap_set_queue(tap, file, q); if (err) { /* tap_sock_destruct() will take care of freeing ptr_ring */ goto err_put; } /* tap groks IOCB_NOWAIT just fine, mark it as such */ file->f_mode |= FMODE_NOWAIT; dev_put(tap->dev); rtnl_unlock(); return err; err_put: sock_put(&q->sk); err: if (tap) dev_put(tap->dev); rtnl_unlock(); return err; } static int tap_release(struct inode *inode, struct file *file) { struct tap_queue *q = file->private_data; tap_put_queue(q); return 0; } static __poll_t tap_poll(struct file *file, poll_table *wait) { struct tap_queue *q = file->private_data; __poll_t mask = EPOLLERR; if (!q) goto out; mask = 0; poll_wait(file, &q->sock.wq.wait, wait); if (!ptr_ring_empty(&q->ring)) mask |= EPOLLIN | EPOLLRDNORM; if (sock_writeable(&q->sk) || (!test_and_set_bit(SOCKWQ_ASYNC_NOSPACE, &q->sock.flags) && sock_writeable(&q->sk))) mask |= EPOLLOUT | EPOLLWRNORM; out: return mask; } static inline struct sk_buff *tap_alloc_skb(struct sock *sk, size_t prepad, size_t len, size_t linear, int noblock, int *err) { struct sk_buff *skb; /* Under a page? Don't bother with paged skb. */ if (prepad + len < PAGE_SIZE || !linear) linear = len; if (len - linear > MAX_SKB_FRAGS * (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) linear = len - MAX_SKB_FRAGS * (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER); skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock, err, PAGE_ALLOC_COSTLY_ORDER); if (!skb) return NULL; skb_reserve(skb, prepad); skb_put(skb, linear); skb->data_len = len - linear; skb->len += len - linear; return skb; } /* Neighbour code has some assumptions on HH_DATA_MOD alignment */ #define TAP_RESERVE HH_DATA_OFF(ETH_HLEN) /* Get packet from user space buffer */ static ssize_t tap_get_user(struct tap_queue *q, void *msg_control, struct iov_iter *from, int noblock) { int good_linear = SKB_MAX_HEAD(TAP_RESERVE); struct sk_buff *skb; struct tap_dev *tap; unsigned long total_len = iov_iter_count(from); unsigned long len = total_len; int err; struct virtio_net_hdr vnet_hdr = { 0 }; int vnet_hdr_len = 0; int hdr_len = 0; int copylen = 0; int depth; bool zerocopy = false; size_t linear; enum skb_drop_reason drop_reason; if (q->flags & IFF_VNET_HDR) { vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz); hdr_len = tun_vnet_hdr_get(vnet_hdr_len, q->flags, from, &vnet_hdr); if (hdr_len < 0) { err = hdr_len; goto err; } len -= vnet_hdr_len; } err = -EINVAL; if (unlikely(len < ETH_HLEN)) goto err; if (msg_control && sock_flag(&q->sk, SOCK_ZEROCOPY)) { struct iov_iter i; copylen = clamp(hdr_len ?: GOODCOPY_LEN, ETH_HLEN, good_linear); linear = copylen; i = *from; iov_iter_advance(&i, copylen); if (iov_iter_npages(&i, INT_MAX) <= MAX_SKB_FRAGS) zerocopy = true; } if (!zerocopy) { copylen = len; linear = clamp(hdr_len, ETH_HLEN, good_linear); } skb = tap_alloc_skb(&q->sk, TAP_RESERVE, copylen, linear, noblock, &err); if (!skb) goto err; if (zerocopy) err = zerocopy_sg_from_iter(skb, from); else err = skb_copy_datagram_from_iter(skb, 0, from, len); if (err) { drop_reason = SKB_DROP_REASON_SKB_UCOPY_FAULT; goto err_kfree; } skb_set_network_header(skb, ETH_HLEN); skb_reset_mac_header(skb); skb->protocol = eth_hdr(skb)->h_proto; rcu_read_lock(); tap = rcu_dereference(q->tap); if (!tap) { kfree_skb(skb); rcu_read_unlock(); return total_len; } skb->dev = tap->dev; if (vnet_hdr_len) { err = tun_vnet_hdr_to_skb(q->flags, skb, &vnet_hdr); if (err) { rcu_read_unlock(); drop_reason = SKB_DROP_REASON_DEV_HDR; goto err_kfree; } } skb_probe_transport_header(skb); /* Move network header to the right position for VLAN tagged packets */ if (eth_type_vlan(skb->protocol) && vlan_get_protocol_and_depth(skb, skb->protocol, &depth) != 0) skb_set_network_header(skb, depth); /* copy skb_ubuf_info for callback when skb has no error */ if (zerocopy) { skb_zcopy_init(skb, msg_control); } else if (msg_control) { struct ubuf_info *uarg = msg_control; uarg->ops->complete(NULL, uarg, false); } dev_queue_xmit(skb); rcu_read_unlock(); return total_len; err_kfree: kfree_skb_reason(skb, drop_reason); err: rcu_read_lock(); tap = rcu_dereference(q->tap); if (tap && tap->count_tx_dropped) tap->count_tx_dropped(tap); rcu_read_unlock(); return err; } static ssize_t tap_write_iter(struct kiocb *iocb, struct iov_iter *from) { struct file *file = iocb->ki_filp; struct tap_queue *q = file->private_data; int noblock = 0; if ((file->f_flags & O_NONBLOCK) || (iocb->ki_flags & IOCB_NOWAIT)) noblock = 1; return tap_get_user(q, NULL, from, noblock); } /* Put packet to the user space buffer */ static ssize_t tap_put_user(struct tap_queue *q, const struct sk_buff *skb, struct iov_iter *iter) { int ret; int vnet_hdr_len = 0; int vlan_offset = 0; int total; if (q->flags & IFF_VNET_HDR) { struct virtio_net_hdr vnet_hdr; vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz); ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr); if (ret) return ret; ret = tun_vnet_hdr_put(vnet_hdr_len, iter, &vnet_hdr); if (ret) return ret; } total = vnet_hdr_len; total += skb->len; if (skb_vlan_tag_present(skb)) { struct { __be16 h_vlan_proto; __be16 h_vlan_TCI; } veth; veth.h_vlan_proto = skb->vlan_proto; veth.h_vlan_TCI = htons(skb_vlan_tag_get(skb)); vlan_offset = offsetof(struct vlan_ethhdr, h_vlan_proto); total += VLAN_HLEN; ret = skb_copy_datagram_iter(skb, 0, iter, vlan_offset); if (ret || !iov_iter_count(iter)) goto done; ret = copy_to_iter(&veth, sizeof(veth), iter); if (ret != sizeof(veth) || !iov_iter_count(iter)) goto done; } ret = skb_copy_datagram_iter(skb, vlan_offset, iter, skb->len - vlan_offset); done: return ret ? ret : total; } static ssize_t tap_do_read(struct tap_queue *q, struct iov_iter *to, int noblock, struct sk_buff *skb) { DEFINE_WAIT(wait); ssize_t ret = 0; if (!iov_iter_count(to)) { kfree_skb(skb); return 0; } if (skb) goto put; while (1) { if (!noblock) prepare_to_wait(sk_sleep(&q->sk), &wait, TASK_INTERRUPTIBLE); /* Read frames from the queue */ skb = ptr_ring_consume(&q->ring); if (skb) break; if (noblock) { ret = -EAGAIN; break; } if (signal_pending(current)) { ret = -ERESTARTSYS; break; } /* Nothing to read, let's sleep */ schedule(); } if (!noblock) finish_wait(sk_sleep(&q->sk), &wait); put: if (skb) { ret = tap_put_user(q, skb, to); if (unlikely(ret < 0)) kfree_skb(skb); else consume_skb(skb); } return ret; } static ssize_t tap_read_iter(struct kiocb *iocb, struct iov_iter *to) { struct file *file = iocb->ki_filp; struct tap_queue *q = file->private_data; ssize_t len = iov_iter_count(to), ret; int noblock = 0; if ((file->f_flags & O_NONBLOCK) || (iocb->ki_flags & IOCB_NOWAIT)) noblock = 1; ret = tap_do_read(q, to, noblock, NULL); ret = min_t(ssize_t, ret, len); if (ret > 0) iocb->ki_pos = ret; return ret; } static struct tap_dev *tap_get_tap_dev(struct tap_queue *q) { struct tap_dev *tap; ASSERT_RTNL(); tap = rtnl_dereference(q->tap); if (tap) dev_hold(tap->dev); return tap; } static void tap_put_tap_dev(struct tap_dev *tap) { dev_put(tap->dev); } static int tap_ioctl_set_queue(struct file *file, unsigned int flags) { struct tap_queue *q = file->private_data; struct tap_dev *tap; int ret; tap = tap_get_tap_dev(q); if (!tap) return -EINVAL; if (flags & IFF_ATTACH_QUEUE) ret = tap_enable_queue(tap, file, q); else if (flags & IFF_DETACH_QUEUE) ret = tap_disable_queue(q); else ret = -EINVAL; tap_put_tap_dev(tap); return ret; } static int set_offload(struct tap_queue *q, unsigned long arg) { struct tap_dev *tap; netdev_features_t features; netdev_features_t feature_mask = 0; tap = rtnl_dereference(q->tap); if (!tap) return -ENOLINK; features = tap->dev->features; if (arg & TUN_F_CSUM) { feature_mask = NETIF_F_HW_CSUM; if (arg & (TUN_F_TSO4 | TUN_F_TSO6)) { if (arg & TUN_F_TSO_ECN) feature_mask |= NETIF_F_TSO_ECN; if (arg & TUN_F_TSO4) feature_mask |= NETIF_F_TSO; if (arg & TUN_F_TSO6) feature_mask |= NETIF_F_TSO6; } /* TODO: for now USO4 and USO6 should work simultaneously */ if ((arg & (TUN_F_USO4 | TUN_F_USO6)) == (TUN_F_USO4 | TUN_F_USO6)) features |= NETIF_F_GSO_UDP_L4; } /* tun/tap driver inverts the usage for TSO offloads, where * setting the TSO bit means that the userspace wants to * accept TSO frames and turning it off means that user space * does not support TSO. * For tap, we have to invert it to mean the same thing. * When user space turns off TSO, we turn off GSO/LRO so that * user-space will not receive TSO frames. */ if (feature_mask & (NETIF_F_TSO | NETIF_F_TSO6) || (feature_mask & (TUN_F_USO4 | TUN_F_USO6)) == (TUN_F_USO4 | TUN_F_USO6)) features |= RX_OFFLOADS; else features &= ~RX_OFFLOADS; /* tap_features are the same as features on tun/tap and * reflect user expectations. */ tap->tap_features = feature_mask; if (tap->update_features) tap->update_features(tap, features); return 0; } /* * provide compatibility with generic tun/tap interface */ static long tap_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct tap_queue *q = file->private_data; struct tap_dev *tap; void __user *argp = (void __user *)arg; struct ifreq __user *ifr = argp; unsigned int __user *up = argp; unsigned short u; int __user *sp = argp; struct sockaddr_storage ss; int s; int ret; switch (cmd) { case TUNSETIFF: /* ignore the name, just look at flags */ if (get_user(u, &ifr->ifr_flags)) return -EFAULT; ret = 0; if ((u & ~TAP_IFFEATURES) != (IFF_NO_PI | IFF_TAP)) ret = -EINVAL; else q->flags = (q->flags & ~TAP_IFFEATURES) | u; return ret; case TUNGETIFF: rtnl_lock(); tap = tap_get_tap_dev(q); if (!tap) { rtnl_unlock(); return -ENOLINK; } ret = 0; u = q->flags; if (copy_to_user(&ifr->ifr_name, tap->dev->name, IFNAMSIZ) || put_user(u, &ifr->ifr_flags)) ret = -EFAULT; tap_put_tap_dev(tap); rtnl_unlock(); return ret; case TUNSETQUEUE: if (get_user(u, &ifr->ifr_flags)) return -EFAULT; rtnl_lock(); ret = tap_ioctl_set_queue(file, u); rtnl_unlock(); return ret; case TUNGETFEATURES: if (put_user(IFF_TAP | IFF_NO_PI | TAP_IFFEATURES, up)) return -EFAULT; return 0; case TUNSETSNDBUF: if (get_user(s, sp)) return -EFAULT; if (s <= 0) return -EINVAL; q->sk.sk_sndbuf = s; return 0; case TUNSETOFFLOAD: /* let the user check for future flags */ if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 | TUN_F_TSO_ECN | TUN_F_UFO | TUN_F_USO4 | TUN_F_USO6)) return -EINVAL; rtnl_lock(); ret = set_offload(q, arg); rtnl_unlock(); return ret; case SIOCGIFHWADDR: rtnl_lock(); tap = tap_get_tap_dev(q); if (!tap) { rtnl_unlock(); return -ENOLINK; } ret = 0; netif_get_mac_address((struct sockaddr *)&ss, dev_net(tap->dev), tap->dev->name); if (copy_to_user(&ifr->ifr_name, tap->dev->name, IFNAMSIZ) || copy_to_user(&ifr->ifr_hwaddr, &ss, sizeof(ifr->ifr_hwaddr))) ret = -EFAULT; tap_put_tap_dev(tap); rtnl_unlock(); return ret; case SIOCSIFHWADDR: if (copy_from_user(&ss, &ifr->ifr_hwaddr, sizeof(ifr->ifr_hwaddr))) return -EFAULT; rtnl_lock(); tap = tap_get_tap_dev(q); if (!tap) { rtnl_unlock(); return -ENOLINK; } if (tap->dev->addr_len > sizeof(ifr->ifr_hwaddr)) ret = -EINVAL; else ret = dev_set_mac_address_user(tap->dev, &ss, NULL); tap_put_tap_dev(tap); rtnl_unlock(); return ret; default: return tun_vnet_ioctl(&q->vnet_hdr_sz, &q->flags, cmd, sp); } } static const struct file_operations tap_fops = { .owner = THIS_MODULE, .open = tap_open, .release = tap_release, .read_iter = tap_read_iter, .write_iter = tap_write_iter, .poll = tap_poll, .unlocked_ioctl = tap_ioctl, .compat_ioctl = compat_ptr_ioctl, }; static int tap_get_user_xdp(struct tap_queue *q, struct xdp_buff *xdp) { struct virtio_net_hdr *gso = xdp->data_hard_start; int buflen = xdp->frame_sz; int vnet_hdr_len = 0; struct tap_dev *tap; struct sk_buff *skb; int err, depth; if (unlikely(xdp->data_end - xdp->data < ETH_HLEN)) { err = -EINVAL; goto err; } if (q->flags & IFF_VNET_HDR) vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz); skb = build_skb(xdp->data_hard_start, buflen); if (!skb) { err = -ENOMEM; goto err; } skb_reserve(skb, xdp->data - xdp->data_hard_start); skb_put(skb, xdp->data_end - xdp->data); skb_set_network_header(skb, ETH_HLEN); skb_reset_mac_header(skb); skb->protocol = eth_hdr(skb)->h_proto; if (vnet_hdr_len) { err = tun_vnet_hdr_to_skb(q->flags, skb, gso); if (err) goto err_kfree; } /* Move network header to the right position for VLAN tagged packets */ if (eth_type_vlan(skb->protocol) && vlan_get_protocol_and_depth(skb, skb->protocol, &depth) != 0) skb_set_network_header(skb, depth); rcu_read_lock(); tap = rcu_dereference(q->tap); if (tap) { skb->dev = tap->dev; skb_probe_transport_header(skb); dev_queue_xmit(skb); } else { kfree_skb(skb); } rcu_read_unlock(); return 0; err_kfree: kfree_skb(skb); err: rcu_read_lock(); tap = rcu_dereference(q->tap); if (tap && tap->count_tx_dropped) tap->count_tx_dropped(tap); rcu_read_unlock(); return err; } static int tap_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len) { struct tap_queue *q = container_of(sock, struct tap_queue, sock); struct tun_msg_ctl *ctl = m->msg_control; struct xdp_buff *xdp; int i; if (m->msg_controllen == sizeof(struct tun_msg_ctl) && ctl && ctl->type == TUN_MSG_PTR) { for (i = 0; i < ctl->num; i++) { xdp = &((struct xdp_buff *)ctl->ptr)[i]; tap_get_user_xdp(q, xdp); } return 0; } return tap_get_user(q, ctl ? ctl->ptr : NULL, &m->msg_iter, m->msg_flags & MSG_DONTWAIT); } static int tap_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len, int flags) { struct tap_queue *q = container_of(sock, struct tap_queue, sock); struct sk_buff *skb = m->msg_control; int ret; if (flags & ~(MSG_DONTWAIT|MSG_TRUNC)) { kfree_skb(skb); return -EINVAL; } ret = tap_do_read(q, &m->msg_iter, flags & MSG_DONTWAIT, skb); if (ret > total_len) { m->msg_flags |= MSG_TRUNC; ret = flags & MSG_TRUNC ? ret : total_len; } return ret; } static int tap_peek_len(struct socket *sock) { struct tap_queue *q = container_of(sock, struct tap_queue, sock); return PTR_RING_PEEK_CALL(&q->ring, __skb_array_len_with_tag); } /* Ops structure to mimic raw sockets with tun */ static const struct proto_ops tap_socket_ops = { .sendmsg = tap_sendmsg, .recvmsg = tap_recvmsg, .peek_len = tap_peek_len, }; /* Get an underlying socket object from tun file. Returns error unless file is * attached to a device. The returned object works like a packet socket, it * can be used for sock_sendmsg/sock_recvmsg. The caller is responsible for * holding a reference to the file for as long as the socket is in use. */ struct socket *tap_get_socket(struct file *file) { struct tap_queue *q; if (file->f_op != &tap_fops) return ERR_PTR(-EINVAL); q = file->private_data; if (!q) return ERR_PTR(-EBADFD); return &q->sock; } EXPORT_SYMBOL_GPL(tap_get_socket); struct ptr_ring *tap_get_ptr_ring(struct file *file) { struct tap_queue *q; if (file->f_op != &tap_fops) return ERR_PTR(-EINVAL); q = file->private_data; if (!q) return ERR_PTR(-EBADFD); return &q->ring; } EXPORT_SYMBOL_GPL(tap_get_ptr_ring); int tap_queue_resize(struct tap_dev *tap) { struct net_device *dev = tap->dev; struct tap_queue *q; struct ptr_ring **rings; int n = tap->numqueues; int ret, i = 0; rings = kmalloc_array(n, sizeof(*rings), GFP_KERNEL); if (!rings) return -ENOMEM; list_for_each_entry(q, &tap->queue_list, next) rings[i++] = &q->ring; ret = ptr_ring_resize_multiple_bh(rings, n, dev->tx_queue_len, GFP_KERNEL, __skb_array_destroy_skb); kfree(rings); return ret; } EXPORT_SYMBOL_GPL(tap_queue_resize); static int tap_list_add(dev_t major, const char *device_name) { struct major_info *tap_major; tap_major = kzalloc(sizeof(*tap_major), GFP_ATOMIC); if (!tap_major) return -ENOMEM; tap_major->major = MAJOR(major); idr_init(&tap_major->minor_idr); spin_lock_init(&tap_major->minor_lock); tap_major->device_name = device_name; list_add_tail_rcu(&tap_major->next, &major_list); return 0; } int tap_create_cdev(struct cdev *tap_cdev, dev_t *tap_major, const char *device_name, struct module *module) { int err; err = alloc_chrdev_region(tap_major, 0, TAP_NUM_DEVS, device_name); if (err) goto out1; cdev_init(tap_cdev, &tap_fops); tap_cdev->owner = module; err = cdev_add(tap_cdev, *tap_major, TAP_NUM_DEVS); if (err) goto out2; err = tap_list_add(*tap_major, device_name); if (err) goto out3; return 0; out3: cdev_del(tap_cdev); out2: unregister_chrdev_region(*tap_major, TAP_NUM_DEVS); out1: return err; } EXPORT_SYMBOL_GPL(tap_create_cdev); void tap_destroy_cdev(dev_t major, struct cdev *tap_cdev) { struct major_info *tap_major, *tmp; cdev_del(tap_cdev); unregister_chrdev_region(major, TAP_NUM_DEVS); list_for_each_entry_safe(tap_major, tmp, &major_list, next) { if (tap_major->major == MAJOR(major)) { idr_destroy(&tap_major->minor_idr); list_del_rcu(&tap_major->next); kfree_rcu(tap_major, rcu); } } } EXPORT_SYMBOL_GPL(tap_destroy_cdev); MODULE_DESCRIPTION("Common library for drivers implementing the TAP interface"); MODULE_AUTHOR("Arnd Bergmann <arnd@arndb.de>"); MODULE_AUTHOR("Sainath Grandhi <sainath.grandhi@intel.com>"); MODULE_LICENSE("GPL"); MODULE_IMPORT_NS("NETDEV_INTERNAL"); |
| 763 763 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 | // SPDX-License-Identifier: GPL-2.0-only /* PIPAPO: PIle PAcket POlicies: AVX2 packet lookup routines * * Copyright (c) 2019-2020 Red Hat GmbH * * Author: Stefano Brivio <sbrivio@redhat.com> */ #include <linux/kernel.h> #include <linux/init.h> #include <linux/module.h> #include <linux/netlink.h> #include <linux/netfilter.h> #include <linux/netfilter/nf_tables.h> #include <net/netfilter/nf_tables_core.h> #include <uapi/linux/netfilter/nf_tables.h> #include <linux/bitmap.h> #include <linux/bitops.h> #include <linux/compiler.h> #include <asm/fpu/api.h> #include "nft_set_pipapo_avx2.h" #include "nft_set_pipapo.h" #define NFT_PIPAPO_LONGS_PER_M256 (XSAVE_YMM_SIZE / BITS_PER_LONG) /* Load from memory into YMM register with non-temporal hint ("stream load"), * that is, don't fetch lines from memory into the cache. This avoids pushing * precious packet data out of the cache hierarchy, and is appropriate when: * * - loading buckets from lookup tables, as they are not going to be used * again before packets are entirely classified * * - loading the result bitmap from the previous field, as it's never used * again */ #define NFT_PIPAPO_AVX2_LOAD(reg, loc) \ asm volatile("vmovntdqa %0, %%ymm" #reg : : "m" (loc)) /* Stream a single lookup table bucket into YMM register given lookup table, * group index, value of packet bits, bucket size. */ #define NFT_PIPAPO_AVX2_BUCKET_LOAD4(reg, lt, group, v, bsize) \ NFT_PIPAPO_AVX2_LOAD(reg, \ lt[((group) * NFT_PIPAPO_BUCKETS(4) + \ (v)) * (bsize)]) #define NFT_PIPAPO_AVX2_BUCKET_LOAD8(reg, lt, group, v, bsize) \ NFT_PIPAPO_AVX2_LOAD(reg, \ lt[((group) * NFT_PIPAPO_BUCKETS(8) + \ (v)) * (bsize)]) /* Bitwise AND: the staple operation of this algorithm */ #define NFT_PIPAPO_AVX2_AND(dst, a, b) \ asm volatile("vpand %ymm" #a ", %ymm" #b ", %ymm" #dst) /* Jump to label if @reg is zero */ #define NFT_PIPAPO_AVX2_NOMATCH_GOTO(reg, label) \ asm goto("vptest %%ymm" #reg ", %%ymm" #reg ";" \ "je %l[" #label "]" : : : : label) /* Store 256 bits from YMM register into memory. Contrary to bucket load * operation, we don't bypass the cache here, as stored matching results * are always used shortly after. */ #define NFT_PIPAPO_AVX2_STORE(loc, reg) \ asm volatile("vmovdqa %%ymm" #reg ", %0" : "=m" (loc)) /* Zero out a complete YMM register, @reg */ #define NFT_PIPAPO_AVX2_ZERO(reg) \ asm volatile("vpxor %ymm" #reg ", %ymm" #reg ", %ymm" #reg) /** * nft_pipapo_avx2_prepare() - Prepare before main algorithm body * * This zeroes out ymm15, which is later used whenever we need to clear a * memory location, by storing its content into memory. */ static void nft_pipapo_avx2_prepare(void) { NFT_PIPAPO_AVX2_ZERO(15); } /** * nft_pipapo_avx2_fill() - Fill a bitmap region with ones * @data: Base memory area * @start: First bit to set * @len: Count of bits to fill * * This is nothing else than a version of bitmap_set(), as used e.g. by * pipapo_refill(), tailored for the microarchitectures using it and better * suited for the specific usage: it's very likely that we'll set a small number * of bits, not crossing a word boundary, and correct branch prediction is * critical here. * * This function doesn't actually use any AVX2 instruction. */ static void nft_pipapo_avx2_fill(unsigned long *data, int start, int len) { int offset = start % BITS_PER_LONG; unsigned long mask; data += start / BITS_PER_LONG; if (likely(len == 1)) { *data |= BIT(offset); return; } if (likely(len < BITS_PER_LONG || offset)) { if (likely(len + offset <= BITS_PER_LONG)) { *data |= GENMASK(len - 1 + offset, offset); return; } *data |= ~0UL << offset; len -= BITS_PER_LONG - offset; data++; if (len <= BITS_PER_LONG) { mask = ~0UL >> (BITS_PER_LONG - len); *data |= mask; return; } } memset(data, 0xff, len / BITS_PER_BYTE); data += len / BITS_PER_LONG; len %= BITS_PER_LONG; if (len) *data |= ~0UL >> (BITS_PER_LONG - len); } /** * nft_pipapo_avx2_refill() - Scan bitmap, select mapping table item, set bits * @offset: Start from given bitmap (equivalent to bucket) offset, in longs * @map: Bitmap to be scanned for set bits * @dst: Destination bitmap * @mt: Mapping table containing bit set specifiers * @last: Return index of first set bit, if this is the last field * * This is an alternative implementation of pipapo_refill() suitable for usage * with AVX2 lookup routines: we know there are four words to be scanned, at * a given offset inside the map, for each matching iteration. * * This function doesn't actually use any AVX2 instruction. * * Return: first set bit index if @last, index of first filled word otherwise. */ static int nft_pipapo_avx2_refill(int offset, unsigned long *map, unsigned long *dst, union nft_pipapo_map_bucket *mt, bool last) { int ret = -1; #define NFT_PIPAPO_AVX2_REFILL_ONE_WORD(x) \ do { \ while (map[(x)]) { \ int r = __builtin_ctzl(map[(x)]); \ int i = (offset + (x)) * BITS_PER_LONG + r; \ \ if (last) \ return i; \ \ nft_pipapo_avx2_fill(dst, mt[i].to, mt[i].n); \ \ if (ret == -1) \ ret = mt[i].to; \ \ map[(x)] &= ~(1UL << r); \ } \ } while (0) NFT_PIPAPO_AVX2_REFILL_ONE_WORD(0); NFT_PIPAPO_AVX2_REFILL_ONE_WORD(1); NFT_PIPAPO_AVX2_REFILL_ONE_WORD(2); NFT_PIPAPO_AVX2_REFILL_ONE_WORD(3); #undef NFT_PIPAPO_AVX2_REFILL_ONE_WORD return ret; } /** * nft_pipapo_avx2_lookup_4b_2() - AVX2-based lookup for 2 four-bit groups * @map: Previous match result, used as initial bitmap * @fill: Destination bitmap to be filled with current match result * @f: Field, containing lookup and mapping tables * @offset: Ignore buckets before the given index, no bits are filled there * @pkt: Packet data, pointer to input nftables register * @first: If this is the first field, don't source previous result * @last: Last field: stop at the first match and return bit index * * Load buckets from lookup table corresponding to the values of each 4-bit * group of packet bytes, and perform a bitwise intersection between them. If * this is the first field in the set, simply AND the buckets together * (equivalent to using an all-ones starting bitmap), use the provided starting * bitmap otherwise. Then call nft_pipapo_avx2_refill() to generate the next * working bitmap, @fill. * * This is used for 8-bit fields (i.e. protocol numbers). * * Out-of-order (and superscalar) execution is vital here, so it's critical to * avoid false data dependencies. CPU and compiler could (mostly) take care of * this on their own, but the operation ordering is explicitly given here with * a likely execution order in mind, to highlight possible stalls. That's why * a number of logically distinct operations (i.e. loading buckets, intersecting * buckets) are interleaved. * * Return: -1 on no match, rule index of match if @last, otherwise first long * word index to be checked next (i.e. first filled word). */ static int nft_pipapo_avx2_lookup_4b_2(unsigned long *map, unsigned long *fill, const struct nft_pipapo_field *f, int offset, const u8 *pkt, bool first, bool last) { int i, ret = -1, m256_size = f->bsize / NFT_PIPAPO_LONGS_PER_M256, b; u8 pg[2] = { pkt[0] >> 4, pkt[0] & 0xf }; unsigned long *lt = f->lt, bsize = f->bsize; lt += offset * NFT_PIPAPO_LONGS_PER_M256; for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) { int i_ul = i * NFT_PIPAPO_LONGS_PER_M256; if (first) { NFT_PIPAPO_AVX2_BUCKET_LOAD4(0, lt, 0, pg[0], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(1, lt, 1, pg[1], bsize); NFT_PIPAPO_AVX2_AND(4, 0, 1); } else { NFT_PIPAPO_AVX2_BUCKET_LOAD4(0, lt, 0, pg[0], bsize); NFT_PIPAPO_AVX2_LOAD(2, map[i_ul]); NFT_PIPAPO_AVX2_BUCKET_LOAD4(1, lt, 1, pg[1], bsize); NFT_PIPAPO_AVX2_NOMATCH_GOTO(2, nothing); NFT_PIPAPO_AVX2_AND(3, 0, 1); NFT_PIPAPO_AVX2_AND(4, 2, 3); } NFT_PIPAPO_AVX2_NOMATCH_GOTO(4, nomatch); NFT_PIPAPO_AVX2_STORE(map[i_ul], 4); b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last); if (last) return b; if (unlikely(ret == -1)) ret = b / XSAVE_YMM_SIZE; continue; nomatch: NFT_PIPAPO_AVX2_STORE(map[i_ul], 15); nothing: ; } return ret; } /** * nft_pipapo_avx2_lookup_4b_4() - AVX2-based lookup for 4 four-bit groups * @map: Previous match result, used as initial bitmap * @fill: Destination bitmap to be filled with current match result * @f: Field, containing lookup and mapping tables * @offset: Ignore buckets before the given index, no bits are filled there * @pkt: Packet data, pointer to input nftables register * @first: If this is the first field, don't source previous result * @last: Last field: stop at the first match and return bit index * * See nft_pipapo_avx2_lookup_4b_2(). * * This is used for 16-bit fields (i.e. ports). * * Return: -1 on no match, rule index of match if @last, otherwise first long * word index to be checked next (i.e. first filled word). */ static int nft_pipapo_avx2_lookup_4b_4(unsigned long *map, unsigned long *fill, const struct nft_pipapo_field *f, int offset, const u8 *pkt, bool first, bool last) { int i, ret = -1, m256_size = f->bsize / NFT_PIPAPO_LONGS_PER_M256, b; u8 pg[4] = { pkt[0] >> 4, pkt[0] & 0xf, pkt[1] >> 4, pkt[1] & 0xf }; unsigned long *lt = f->lt, bsize = f->bsize; lt += offset * NFT_PIPAPO_LONGS_PER_M256; for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) { int i_ul = i * NFT_PIPAPO_LONGS_PER_M256; if (first) { NFT_PIPAPO_AVX2_BUCKET_LOAD4(0, lt, 0, pg[0], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(1, lt, 1, pg[1], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(2, lt, 2, pg[2], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(3, lt, 3, pg[3], bsize); NFT_PIPAPO_AVX2_AND(4, 0, 1); NFT_PIPAPO_AVX2_AND(5, 2, 3); NFT_PIPAPO_AVX2_AND(7, 4, 5); } else { NFT_PIPAPO_AVX2_BUCKET_LOAD4(0, lt, 0, pg[0], bsize); NFT_PIPAPO_AVX2_LOAD(1, map[i_ul]); NFT_PIPAPO_AVX2_BUCKET_LOAD4(2, lt, 1, pg[1], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(3, lt, 2, pg[2], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(4, lt, 3, pg[3], bsize); NFT_PIPAPO_AVX2_AND(5, 0, 1); NFT_PIPAPO_AVX2_NOMATCH_GOTO(1, nothing); NFT_PIPAPO_AVX2_AND(6, 2, 3); NFT_PIPAPO_AVX2_AND(7, 4, 5); /* Stall */ NFT_PIPAPO_AVX2_AND(7, 6, 7); } /* Stall */ NFT_PIPAPO_AVX2_NOMATCH_GOTO(7, nomatch); NFT_PIPAPO_AVX2_STORE(map[i_ul], 7); b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last); if (last) return b; if (unlikely(ret == -1)) ret = b / XSAVE_YMM_SIZE; continue; nomatch: NFT_PIPAPO_AVX2_STORE(map[i_ul], 15); nothing: ; } return ret; } /** * nft_pipapo_avx2_lookup_4b_8() - AVX2-based lookup for 8 four-bit groups * @map: Previous match result, used as initial bitmap * @fill: Destination bitmap to be filled with current match result * @f: Field, containing lookup and mapping tables * @offset: Ignore buckets before the given index, no bits are filled there * @pkt: Packet data, pointer to input nftables register * @first: If this is the first field, don't source previous result * @last: Last field: stop at the first match and return bit index * * See nft_pipapo_avx2_lookup_4b_2(). * * This is used for 32-bit fields (i.e. IPv4 addresses). * * Return: -1 on no match, rule index of match if @last, otherwise first long * word index to be checked next (i.e. first filled word). */ static int nft_pipapo_avx2_lookup_4b_8(unsigned long *map, unsigned long *fill, const struct nft_pipapo_field *f, int offset, const u8 *pkt, bool first, bool last) { u8 pg[8] = { pkt[0] >> 4, pkt[0] & 0xf, pkt[1] >> 4, pkt[1] & 0xf, pkt[2] >> 4, pkt[2] & 0xf, pkt[3] >> 4, pkt[3] & 0xf, }; int i, ret = -1, m256_size = f->bsize / NFT_PIPAPO_LONGS_PER_M256, b; unsigned long *lt = f->lt, bsize = f->bsize; lt += offset * NFT_PIPAPO_LONGS_PER_M256; for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) { int i_ul = i * NFT_PIPAPO_LONGS_PER_M256; if (first) { NFT_PIPAPO_AVX2_BUCKET_LOAD4(0, lt, 0, pg[0], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(1, lt, 1, pg[1], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(2, lt, 2, pg[2], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(3, lt, 3, pg[3], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(4, lt, 4, pg[4], bsize); NFT_PIPAPO_AVX2_AND(5, 0, 1); NFT_PIPAPO_AVX2_BUCKET_LOAD4(6, lt, 5, pg[5], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(7, lt, 6, pg[6], bsize); NFT_PIPAPO_AVX2_AND(8, 2, 3); NFT_PIPAPO_AVX2_AND(9, 4, 5); NFT_PIPAPO_AVX2_BUCKET_LOAD4(10, lt, 7, pg[7], bsize); NFT_PIPAPO_AVX2_AND(11, 6, 7); NFT_PIPAPO_AVX2_AND(12, 8, 9); NFT_PIPAPO_AVX2_AND(13, 10, 11); /* Stall */ NFT_PIPAPO_AVX2_AND(1, 12, 13); } else { NFT_PIPAPO_AVX2_BUCKET_LOAD4(0, lt, 0, pg[0], bsize); NFT_PIPAPO_AVX2_LOAD(1, map[i_ul]); NFT_PIPAPO_AVX2_BUCKET_LOAD4(2, lt, 1, pg[1], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(3, lt, 2, pg[2], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(4, lt, 3, pg[3], bsize); NFT_PIPAPO_AVX2_NOMATCH_GOTO(1, nothing); NFT_PIPAPO_AVX2_AND(5, 0, 1); NFT_PIPAPO_AVX2_BUCKET_LOAD4(6, lt, 4, pg[4], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(7, lt, 5, pg[5], bsize); NFT_PIPAPO_AVX2_AND(8, 2, 3); NFT_PIPAPO_AVX2_BUCKET_LOAD4(9, lt, 6, pg[6], bsize); NFT_PIPAPO_AVX2_AND(10, 4, 5); NFT_PIPAPO_AVX2_BUCKET_LOAD4(11, lt, 7, pg[7], bsize); NFT_PIPAPO_AVX2_AND(12, 6, 7); NFT_PIPAPO_AVX2_AND(13, 8, 9); NFT_PIPAPO_AVX2_AND(14, 10, 11); /* Stall */ NFT_PIPAPO_AVX2_AND(1, 12, 13); NFT_PIPAPO_AVX2_AND(1, 1, 14); } NFT_PIPAPO_AVX2_NOMATCH_GOTO(1, nomatch); NFT_PIPAPO_AVX2_STORE(map[i_ul], 1); b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last); if (last) return b; if (unlikely(ret == -1)) ret = b / XSAVE_YMM_SIZE; continue; nomatch: NFT_PIPAPO_AVX2_STORE(map[i_ul], 15); nothing: ; } return ret; } /** * nft_pipapo_avx2_lookup_4b_12() - AVX2-based lookup for 12 four-bit groups * @map: Previous match result, used as initial bitmap * @fill: Destination bitmap to be filled with current match result * @f: Field, containing lookup and mapping tables * @offset: Ignore buckets before the given index, no bits are filled there * @pkt: Packet data, pointer to input nftables register * @first: If this is the first field, don't source previous result * @last: Last field: stop at the first match and return bit index * * See nft_pipapo_avx2_lookup_4b_2(). * * This is used for 48-bit fields (i.e. MAC addresses/EUI-48). * * Return: -1 on no match, rule index of match if @last, otherwise first long * word index to be checked next (i.e. first filled word). */ static int nft_pipapo_avx2_lookup_4b_12(unsigned long *map, unsigned long *fill, const struct nft_pipapo_field *f, int offset, const u8 *pkt, bool first, bool last) { u8 pg[12] = { pkt[0] >> 4, pkt[0] & 0xf, pkt[1] >> 4, pkt[1] & 0xf, pkt[2] >> 4, pkt[2] & 0xf, pkt[3] >> 4, pkt[3] & 0xf, pkt[4] >> 4, pkt[4] & 0xf, pkt[5] >> 4, pkt[5] & 0xf, }; int i, ret = -1, m256_size = f->bsize / NFT_PIPAPO_LONGS_PER_M256, b; unsigned long *lt = f->lt, bsize = f->bsize; lt += offset * NFT_PIPAPO_LONGS_PER_M256; for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) { int i_ul = i * NFT_PIPAPO_LONGS_PER_M256; if (!first) NFT_PIPAPO_AVX2_LOAD(0, map[i_ul]); NFT_PIPAPO_AVX2_BUCKET_LOAD4(1, lt, 0, pg[0], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(2, lt, 1, pg[1], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(3, lt, 2, pg[2], bsize); if (!first) { NFT_PIPAPO_AVX2_NOMATCH_GOTO(0, nothing); NFT_PIPAPO_AVX2_AND(1, 1, 0); } NFT_PIPAPO_AVX2_BUCKET_LOAD4(4, lt, 3, pg[3], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(5, lt, 4, pg[4], bsize); NFT_PIPAPO_AVX2_AND(6, 2, 3); NFT_PIPAPO_AVX2_BUCKET_LOAD4(7, lt, 5, pg[5], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(8, lt, 6, pg[6], bsize); NFT_PIPAPO_AVX2_AND(9, 1, 4); NFT_PIPAPO_AVX2_BUCKET_LOAD4(10, lt, 7, pg[7], bsize); NFT_PIPAPO_AVX2_AND(11, 5, 6); NFT_PIPAPO_AVX2_BUCKET_LOAD4(12, lt, 8, pg[8], bsize); NFT_PIPAPO_AVX2_AND(13, 7, 8); NFT_PIPAPO_AVX2_BUCKET_LOAD4(14, lt, 9, pg[9], bsize); NFT_PIPAPO_AVX2_AND(0, 9, 10); NFT_PIPAPO_AVX2_BUCKET_LOAD4(1, lt, 10, pg[10], bsize); NFT_PIPAPO_AVX2_AND(2, 11, 12); NFT_PIPAPO_AVX2_BUCKET_LOAD4(3, lt, 11, pg[11], bsize); NFT_PIPAPO_AVX2_AND(4, 13, 14); NFT_PIPAPO_AVX2_AND(5, 0, 1); NFT_PIPAPO_AVX2_AND(6, 2, 3); /* Stalls */ NFT_PIPAPO_AVX2_AND(7, 4, 5); NFT_PIPAPO_AVX2_AND(8, 6, 7); NFT_PIPAPO_AVX2_NOMATCH_GOTO(8, nomatch); NFT_PIPAPO_AVX2_STORE(map[i_ul], 8); b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last); if (last) return b; if (unlikely(ret == -1)) ret = b / XSAVE_YMM_SIZE; continue; nomatch: NFT_PIPAPO_AVX2_STORE(map[i_ul], 15); nothing: ; } return ret; } /** * nft_pipapo_avx2_lookup_4b_32() - AVX2-based lookup for 32 four-bit groups * @map: Previous match result, used as initial bitmap * @fill: Destination bitmap to be filled with current match result * @f: Field, containing lookup and mapping tables * @offset: Ignore buckets before the given index, no bits are filled there * @pkt: Packet data, pointer to input nftables register * @first: If this is the first field, don't source previous result * @last: Last field: stop at the first match and return bit index * * See nft_pipapo_avx2_lookup_4b_2(). * * This is used for 128-bit fields (i.e. IPv6 addresses). * * Return: -1 on no match, rule index of match if @last, otherwise first long * word index to be checked next (i.e. first filled word). */ static int nft_pipapo_avx2_lookup_4b_32(unsigned long *map, unsigned long *fill, const struct nft_pipapo_field *f, int offset, const u8 *pkt, bool first, bool last) { u8 pg[32] = { pkt[0] >> 4, pkt[0] & 0xf, pkt[1] >> 4, pkt[1] & 0xf, pkt[2] >> 4, pkt[2] & 0xf, pkt[3] >> 4, pkt[3] & 0xf, pkt[4] >> 4, pkt[4] & 0xf, pkt[5] >> 4, pkt[5] & 0xf, pkt[6] >> 4, pkt[6] & 0xf, pkt[7] >> 4, pkt[7] & 0xf, pkt[8] >> 4, pkt[8] & 0xf, pkt[9] >> 4, pkt[9] & 0xf, pkt[10] >> 4, pkt[10] & 0xf, pkt[11] >> 4, pkt[11] & 0xf, pkt[12] >> 4, pkt[12] & 0xf, pkt[13] >> 4, pkt[13] & 0xf, pkt[14] >> 4, pkt[14] & 0xf, pkt[15] >> 4, pkt[15] & 0xf, }; int i, ret = -1, m256_size = f->bsize / NFT_PIPAPO_LONGS_PER_M256, b; unsigned long *lt = f->lt, bsize = f->bsize; lt += offset * NFT_PIPAPO_LONGS_PER_M256; for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) { int i_ul = i * NFT_PIPAPO_LONGS_PER_M256; if (!first) NFT_PIPAPO_AVX2_LOAD(0, map[i_ul]); NFT_PIPAPO_AVX2_BUCKET_LOAD4(1, lt, 0, pg[0], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(2, lt, 1, pg[1], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(3, lt, 2, pg[2], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(4, lt, 3, pg[3], bsize); if (!first) { NFT_PIPAPO_AVX2_NOMATCH_GOTO(0, nothing); NFT_PIPAPO_AVX2_AND(1, 1, 0); } NFT_PIPAPO_AVX2_AND(5, 2, 3); NFT_PIPAPO_AVX2_BUCKET_LOAD4(6, lt, 4, pg[4], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(7, lt, 5, pg[5], bsize); NFT_PIPAPO_AVX2_AND(8, 1, 4); NFT_PIPAPO_AVX2_BUCKET_LOAD4(9, lt, 6, pg[6], bsize); NFT_PIPAPO_AVX2_AND(10, 5, 6); NFT_PIPAPO_AVX2_BUCKET_LOAD4(11, lt, 7, pg[7], bsize); NFT_PIPAPO_AVX2_AND(12, 7, 8); NFT_PIPAPO_AVX2_BUCKET_LOAD4(13, lt, 8, pg[8], bsize); NFT_PIPAPO_AVX2_AND(14, 9, 10); NFT_PIPAPO_AVX2_BUCKET_LOAD4(0, lt, 9, pg[9], bsize); NFT_PIPAPO_AVX2_AND(1, 11, 12); NFT_PIPAPO_AVX2_BUCKET_LOAD4(2, lt, 10, pg[10], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(3, lt, 11, pg[11], bsize); NFT_PIPAPO_AVX2_AND(4, 13, 14); NFT_PIPAPO_AVX2_BUCKET_LOAD4(5, lt, 12, pg[12], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(6, lt, 13, pg[13], bsize); NFT_PIPAPO_AVX2_AND(7, 0, 1); NFT_PIPAPO_AVX2_BUCKET_LOAD4(8, lt, 14, pg[14], bsize); NFT_PIPAPO_AVX2_AND(9, 2, 3); NFT_PIPAPO_AVX2_BUCKET_LOAD4(10, lt, 15, pg[15], bsize); NFT_PIPAPO_AVX2_AND(11, 4, 5); NFT_PIPAPO_AVX2_BUCKET_LOAD4(12, lt, 16, pg[16], bsize); NFT_PIPAPO_AVX2_AND(13, 6, 7); NFT_PIPAPO_AVX2_BUCKET_LOAD4(14, lt, 17, pg[17], bsize); NFT_PIPAPO_AVX2_AND(0, 8, 9); NFT_PIPAPO_AVX2_BUCKET_LOAD4(1, lt, 18, pg[18], bsize); NFT_PIPAPO_AVX2_AND(2, 10, 11); NFT_PIPAPO_AVX2_BUCKET_LOAD4(3, lt, 19, pg[19], bsize); NFT_PIPAPO_AVX2_AND(4, 12, 13); NFT_PIPAPO_AVX2_BUCKET_LOAD4(5, lt, 20, pg[20], bsize); NFT_PIPAPO_AVX2_AND(6, 14, 0); NFT_PIPAPO_AVX2_AND(7, 1, 2); NFT_PIPAPO_AVX2_BUCKET_LOAD4(8, lt, 21, pg[21], bsize); NFT_PIPAPO_AVX2_AND(9, 3, 4); NFT_PIPAPO_AVX2_BUCKET_LOAD4(10, lt, 22, pg[22], bsize); NFT_PIPAPO_AVX2_AND(11, 5, 6); NFT_PIPAPO_AVX2_BUCKET_LOAD4(12, lt, 23, pg[23], bsize); NFT_PIPAPO_AVX2_AND(13, 7, 8); NFT_PIPAPO_AVX2_BUCKET_LOAD4(14, lt, 24, pg[24], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(0, lt, 25, pg[25], bsize); NFT_PIPAPO_AVX2_AND(1, 9, 10); NFT_PIPAPO_AVX2_AND(2, 11, 12); NFT_PIPAPO_AVX2_BUCKET_LOAD4(3, lt, 26, pg[26], bsize); NFT_PIPAPO_AVX2_AND(4, 13, 14); NFT_PIPAPO_AVX2_BUCKET_LOAD4(5, lt, 27, pg[27], bsize); NFT_PIPAPO_AVX2_AND(6, 0, 1); NFT_PIPAPO_AVX2_BUCKET_LOAD4(7, lt, 28, pg[28], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD4(8, lt, 29, pg[29], bsize); NFT_PIPAPO_AVX2_AND(9, 2, 3); NFT_PIPAPO_AVX2_BUCKET_LOAD4(10, lt, 30, pg[30], bsize); NFT_PIPAPO_AVX2_AND(11, 4, 5); NFT_PIPAPO_AVX2_BUCKET_LOAD4(12, lt, 31, pg[31], bsize); NFT_PIPAPO_AVX2_AND(0, 6, 7); NFT_PIPAPO_AVX2_AND(1, 8, 9); NFT_PIPAPO_AVX2_AND(2, 10, 11); NFT_PIPAPO_AVX2_AND(3, 12, 0); /* Stalls */ NFT_PIPAPO_AVX2_AND(4, 1, 2); NFT_PIPAPO_AVX2_AND(5, 3, 4); NFT_PIPAPO_AVX2_NOMATCH_GOTO(5, nomatch); NFT_PIPAPO_AVX2_STORE(map[i_ul], 5); b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last); if (last) return b; if (unlikely(ret == -1)) ret = b / XSAVE_YMM_SIZE; continue; nomatch: NFT_PIPAPO_AVX2_STORE(map[i_ul], 15); nothing: ; } return ret; } /** * nft_pipapo_avx2_lookup_8b_1() - AVX2-based lookup for one eight-bit group * @map: Previous match result, used as initial bitmap * @fill: Destination bitmap to be filled with current match result * @f: Field, containing lookup and mapping tables * @offset: Ignore buckets before the given index, no bits are filled there * @pkt: Packet data, pointer to input nftables register * @first: If this is the first field, don't source previous result * @last: Last field: stop at the first match and return bit index * * See nft_pipapo_avx2_lookup_4b_2(). * * This is used for 8-bit fields (i.e. protocol numbers). * * Return: -1 on no match, rule index of match if @last, otherwise first long * word index to be checked next (i.e. first filled word). */ static int nft_pipapo_avx2_lookup_8b_1(unsigned long *map, unsigned long *fill, const struct nft_pipapo_field *f, int offset, const u8 *pkt, bool first, bool last) { int i, ret = -1, m256_size = f->bsize / NFT_PIPAPO_LONGS_PER_M256, b; unsigned long *lt = f->lt, bsize = f->bsize; lt += offset * NFT_PIPAPO_LONGS_PER_M256; for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) { int i_ul = i * NFT_PIPAPO_LONGS_PER_M256; if (first) { NFT_PIPAPO_AVX2_BUCKET_LOAD8(2, lt, 0, pkt[0], bsize); } else { NFT_PIPAPO_AVX2_BUCKET_LOAD8(0, lt, 0, pkt[0], bsize); NFT_PIPAPO_AVX2_LOAD(1, map[i_ul]); NFT_PIPAPO_AVX2_AND(2, 0, 1); NFT_PIPAPO_AVX2_NOMATCH_GOTO(1, nothing); } NFT_PIPAPO_AVX2_NOMATCH_GOTO(2, nomatch); NFT_PIPAPO_AVX2_STORE(map[i_ul], 2); b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last); if (last) return b; if (unlikely(ret == -1)) ret = b / XSAVE_YMM_SIZE; continue; nomatch: NFT_PIPAPO_AVX2_STORE(map[i_ul], 15); nothing: ; } return ret; } /** * nft_pipapo_avx2_lookup_8b_2() - AVX2-based lookup for 2 eight-bit groups * @map: Previous match result, used as initial bitmap * @fill: Destination bitmap to be filled with current match result * @f: Field, containing lookup and mapping tables * @offset: Ignore buckets before the given index, no bits are filled there * @pkt: Packet data, pointer to input nftables register * @first: If this is the first field, don't source previous result * @last: Last field: stop at the first match and return bit index * * See nft_pipapo_avx2_lookup_4b_2(). * * This is used for 16-bit fields (i.e. ports). * * Return: -1 on no match, rule index of match if @last, otherwise first long * word index to be checked next (i.e. first filled word). */ static int nft_pipapo_avx2_lookup_8b_2(unsigned long *map, unsigned long *fill, const struct nft_pipapo_field *f, int offset, const u8 *pkt, bool first, bool last) { int i, ret = -1, m256_size = f->bsize / NFT_PIPAPO_LONGS_PER_M256, b; unsigned long *lt = f->lt, bsize = f->bsize; lt += offset * NFT_PIPAPO_LONGS_PER_M256; for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) { int i_ul = i * NFT_PIPAPO_LONGS_PER_M256; if (first) { NFT_PIPAPO_AVX2_BUCKET_LOAD8(0, lt, 0, pkt[0], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD8(1, lt, 1, pkt[1], bsize); NFT_PIPAPO_AVX2_AND(4, 0, 1); } else { NFT_PIPAPO_AVX2_LOAD(0, map[i_ul]); NFT_PIPAPO_AVX2_BUCKET_LOAD8(1, lt, 0, pkt[0], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD8(2, lt, 1, pkt[1], bsize); /* Stall */ NFT_PIPAPO_AVX2_AND(3, 0, 1); NFT_PIPAPO_AVX2_NOMATCH_GOTO(0, nothing); NFT_PIPAPO_AVX2_AND(4, 3, 2); } /* Stall */ NFT_PIPAPO_AVX2_NOMATCH_GOTO(4, nomatch); NFT_PIPAPO_AVX2_STORE(map[i_ul], 4); b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last); if (last) return b; if (unlikely(ret == -1)) ret = b / XSAVE_YMM_SIZE; continue; nomatch: NFT_PIPAPO_AVX2_STORE(map[i_ul], 15); nothing: ; } return ret; } /** * nft_pipapo_avx2_lookup_8b_4() - AVX2-based lookup for 4 eight-bit groups * @map: Previous match result, used as initial bitmap * @fill: Destination bitmap to be filled with current match result * @f: Field, containing lookup and mapping tables * @offset: Ignore buckets before the given index, no bits are filled there * @pkt: Packet data, pointer to input nftables register * @first: If this is the first field, don't source previous result * @last: Last field: stop at the first match and return bit index * * See nft_pipapo_avx2_lookup_4b_2(). * * This is used for 32-bit fields (i.e. IPv4 addresses). * * Return: -1 on no match, rule index of match if @last, otherwise first long * word index to be checked next (i.e. first filled word). */ static int nft_pipapo_avx2_lookup_8b_4(unsigned long *map, unsigned long *fill, const struct nft_pipapo_field *f, int offset, const u8 *pkt, bool first, bool last) { int i, ret = -1, m256_size = f->bsize / NFT_PIPAPO_LONGS_PER_M256, b; unsigned long *lt = f->lt, bsize = f->bsize; lt += offset * NFT_PIPAPO_LONGS_PER_M256; for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) { int i_ul = i * NFT_PIPAPO_LONGS_PER_M256; if (first) { NFT_PIPAPO_AVX2_BUCKET_LOAD8(0, lt, 0, pkt[0], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD8(1, lt, 1, pkt[1], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD8(2, lt, 2, pkt[2], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD8(3, lt, 3, pkt[3], bsize); /* Stall */ NFT_PIPAPO_AVX2_AND(4, 0, 1); NFT_PIPAPO_AVX2_AND(5, 2, 3); NFT_PIPAPO_AVX2_AND(0, 4, 5); } else { NFT_PIPAPO_AVX2_BUCKET_LOAD8(0, lt, 0, pkt[0], bsize); NFT_PIPAPO_AVX2_LOAD(1, map[i_ul]); NFT_PIPAPO_AVX2_BUCKET_LOAD8(2, lt, 1, pkt[1], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD8(3, lt, 2, pkt[2], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD8(4, lt, 3, pkt[3], bsize); NFT_PIPAPO_AVX2_AND(5, 0, 1); NFT_PIPAPO_AVX2_NOMATCH_GOTO(1, nothing); NFT_PIPAPO_AVX2_AND(6, 2, 3); /* Stall */ NFT_PIPAPO_AVX2_AND(7, 4, 5); NFT_PIPAPO_AVX2_AND(0, 6, 7); } NFT_PIPAPO_AVX2_NOMATCH_GOTO(0, nomatch); NFT_PIPAPO_AVX2_STORE(map[i_ul], 0); b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last); if (last) return b; if (unlikely(ret == -1)) ret = b / XSAVE_YMM_SIZE; continue; nomatch: NFT_PIPAPO_AVX2_STORE(map[i_ul], 15); nothing: ; } return ret; } /** * nft_pipapo_avx2_lookup_8b_6() - AVX2-based lookup for 6 eight-bit groups * @map: Previous match result, used as initial bitmap * @fill: Destination bitmap to be filled with current match result * @f: Field, containing lookup and mapping tables * @offset: Ignore buckets before the given index, no bits are filled there * @pkt: Packet data, pointer to input nftables register * @first: If this is the first field, don't source previous result * @last: Last field: stop at the first match and return bit index * * See nft_pipapo_avx2_lookup_4b_2(). * * This is used for 48-bit fields (i.e. MAC addresses/EUI-48). * * Return: -1 on no match, rule index of match if @last, otherwise first long * word index to be checked next (i.e. first filled word). */ static int nft_pipapo_avx2_lookup_8b_6(unsigned long *map, unsigned long *fill, const struct nft_pipapo_field *f, int offset, const u8 *pkt, bool first, bool last) { int i, ret = -1, m256_size = f->bsize / NFT_PIPAPO_LONGS_PER_M256, b; unsigned long *lt = f->lt, bsize = f->bsize; lt += offset * NFT_PIPAPO_LONGS_PER_M256; for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) { int i_ul = i * NFT_PIPAPO_LONGS_PER_M256; if (first) { NFT_PIPAPO_AVX2_BUCKET_LOAD8(0, lt, 0, pkt[0], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD8(1, lt, 1, pkt[1], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD8(2, lt, 2, pkt[2], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD8(3, lt, 3, pkt[3], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD8(4, lt, 4, pkt[4], bsize); NFT_PIPAPO_AVX2_AND(5, 0, 1); NFT_PIPAPO_AVX2_BUCKET_LOAD8(6, lt, 5, pkt[5], bsize); NFT_PIPAPO_AVX2_AND(7, 2, 3); /* Stall */ NFT_PIPAPO_AVX2_AND(0, 4, 5); NFT_PIPAPO_AVX2_AND(1, 6, 7); NFT_PIPAPO_AVX2_AND(4, 0, 1); } else { NFT_PIPAPO_AVX2_BUCKET_LOAD8(0, lt, 0, pkt[0], bsize); NFT_PIPAPO_AVX2_LOAD(1, map[i_ul]); NFT_PIPAPO_AVX2_BUCKET_LOAD8(2, lt, 1, pkt[1], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD8(3, lt, 2, pkt[2], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD8(4, lt, 3, pkt[3], bsize); NFT_PIPAPO_AVX2_AND(5, 0, 1); NFT_PIPAPO_AVX2_NOMATCH_GOTO(1, nothing); NFT_PIPAPO_AVX2_AND(6, 2, 3); NFT_PIPAPO_AVX2_BUCKET_LOAD8(7, lt, 4, pkt[4], bsize); NFT_PIPAPO_AVX2_AND(0, 4, 5); NFT_PIPAPO_AVX2_BUCKET_LOAD8(1, lt, 5, pkt[5], bsize); NFT_PIPAPO_AVX2_AND(2, 6, 7); /* Stall */ NFT_PIPAPO_AVX2_AND(3, 0, 1); NFT_PIPAPO_AVX2_AND(4, 2, 3); } NFT_PIPAPO_AVX2_NOMATCH_GOTO(4, nomatch); NFT_PIPAPO_AVX2_STORE(map[i_ul], 4); b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last); if (last) return b; if (unlikely(ret == -1)) ret = b / XSAVE_YMM_SIZE; continue; nomatch: NFT_PIPAPO_AVX2_STORE(map[i_ul], 15); nothing: ; } return ret; } /** * nft_pipapo_avx2_lookup_8b_16() - AVX2-based lookup for 16 eight-bit groups * @map: Previous match result, used as initial bitmap * @fill: Destination bitmap to be filled with current match result * @f: Field, containing lookup and mapping tables * @offset: Ignore buckets before the given index, no bits are filled there * @pkt: Packet data, pointer to input nftables register * @first: If this is the first field, don't source previous result * @last: Last field: stop at the first match and return bit index * * See nft_pipapo_avx2_lookup_4b_2(). * * This is used for 128-bit fields (i.e. IPv6 addresses). * * Return: -1 on no match, rule index of match if @last, otherwise first long * word index to be checked next (i.e. first filled word). */ static int nft_pipapo_avx2_lookup_8b_16(unsigned long *map, unsigned long *fill, const struct nft_pipapo_field *f, int offset, const u8 *pkt, bool first, bool last) { int i, ret = -1, m256_size = f->bsize / NFT_PIPAPO_LONGS_PER_M256, b; unsigned long *lt = f->lt, bsize = f->bsize; lt += offset * NFT_PIPAPO_LONGS_PER_M256; for (i = offset; i < m256_size; i++, lt += NFT_PIPAPO_LONGS_PER_M256) { int i_ul = i * NFT_PIPAPO_LONGS_PER_M256; if (!first) NFT_PIPAPO_AVX2_LOAD(0, map[i_ul]); NFT_PIPAPO_AVX2_BUCKET_LOAD8(1, lt, 0, pkt[0], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD8(2, lt, 1, pkt[1], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD8(3, lt, 2, pkt[2], bsize); if (!first) { NFT_PIPAPO_AVX2_NOMATCH_GOTO(0, nothing); NFT_PIPAPO_AVX2_AND(1, 1, 0); } NFT_PIPAPO_AVX2_BUCKET_LOAD8(4, lt, 3, pkt[3], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD8(5, lt, 4, pkt[4], bsize); NFT_PIPAPO_AVX2_AND(6, 1, 2); NFT_PIPAPO_AVX2_BUCKET_LOAD8(7, lt, 5, pkt[5], bsize); NFT_PIPAPO_AVX2_AND(0, 3, 4); NFT_PIPAPO_AVX2_BUCKET_LOAD8(1, lt, 6, pkt[6], bsize); NFT_PIPAPO_AVX2_BUCKET_LOAD8(2, lt, 7, pkt[7], bsize); NFT_PIPAPO_AVX2_AND(3, 5, 6); NFT_PIPAPO_AVX2_AND(4, 0, 1); NFT_PIPAPO_AVX2_BUCKET_LOAD8(5, lt, 8, pkt[8], bsize); NFT_PIPAPO_AVX2_AND(6, 2, 3); NFT_PIPAPO_AVX2_AND(3, 4, 7); NFT_PIPAPO_AVX2_BUCKET_LOAD8(7, lt, 9, pkt[9], bsize); NFT_PIPAPO_AVX2_AND(0, 3, 5); NFT_PIPAPO_AVX2_BUCKET_LOAD8(1, lt, 10, pkt[10], bsize); NFT_PIPAPO_AVX2_AND(2, 6, 7); NFT_PIPAPO_AVX2_BUCKET_LOAD8(3, lt, 11, pkt[11], bsize); NFT_PIPAPO_AVX2_AND(4, 0, 1); NFT_PIPAPO_AVX2_BUCKET_LOAD8(5, lt, 12, pkt[12], bsize); NFT_PIPAPO_AVX2_AND(6, 2, 3); NFT_PIPAPO_AVX2_BUCKET_LOAD8(7, lt, 13, pkt[13], bsize); NFT_PIPAPO_AVX2_AND(0, 4, 5); NFT_PIPAPO_AVX2_BUCKET_LOAD8(1, lt, 14, pkt[14], bsize); NFT_PIPAPO_AVX2_AND(2, 6, 7); NFT_PIPAPO_AVX2_BUCKET_LOAD8(3, lt, 15, pkt[15], bsize); NFT_PIPAPO_AVX2_AND(4, 0, 1); /* Stall */ NFT_PIPAPO_AVX2_AND(5, 2, 3); NFT_PIPAPO_AVX2_AND(6, 4, 5); NFT_PIPAPO_AVX2_NOMATCH_GOTO(6, nomatch); NFT_PIPAPO_AVX2_STORE(map[i_ul], 6); b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last); if (last) return b; if (unlikely(ret == -1)) ret = b / XSAVE_YMM_SIZE; continue; nomatch: NFT_PIPAPO_AVX2_STORE(map[i_ul], 15); nothing: ; } return ret; } /** * nft_pipapo_avx2_lookup_slow() - Fallback function for uncommon field sizes * @mdata: Matching data, including mapping table * @map: Previous match result, used as initial bitmap * @fill: Destination bitmap to be filled with current match result * @f: Field, containing lookup and mapping tables * @offset: Ignore buckets before the given index, no bits are filled there * @pkt: Packet data, pointer to input nftables register * @first: If this is the first field, don't source previous result * @last: Last field: stop at the first match and return bit index * * This function should never be called, but is provided for the case the field * size doesn't match any of the known data types. Matching rate is * substantially lower than AVX2 routines. * * Return: -1 on no match, rule index of match if @last, otherwise first long * word index to be checked next (i.e. first filled word). */ static int nft_pipapo_avx2_lookup_slow(const struct nft_pipapo_match *mdata, unsigned long *map, unsigned long *fill, const struct nft_pipapo_field *f, int offset, const u8 *pkt, bool first, bool last) { unsigned long bsize = f->bsize; int i, ret = -1, b; if (first) pipapo_resmap_init(mdata, map); for (i = offset; i < bsize; i++) { if (f->bb == 8) pipapo_and_field_buckets_8bit(f, map, pkt); else pipapo_and_field_buckets_4bit(f, map, pkt); NFT_PIPAPO_GROUP_BITS_ARE_8_OR_4; b = pipapo_refill(map, bsize, f->rules, fill, f->mt, last); if (last) return b; if (ret == -1) ret = b / XSAVE_YMM_SIZE; } return ret; } /** * nft_pipapo_avx2_estimate() - Set size, space and lookup complexity * @desc: Set description, element count and field description used * @features: Flags: NFT_SET_INTERVAL needs to be there * @est: Storage for estimation data * * Return: true if set is compatible and AVX2 available, false otherwise. */ bool nft_pipapo_avx2_estimate(const struct nft_set_desc *desc, u32 features, struct nft_set_estimate *est) { if (!(features & NFT_SET_INTERVAL) || desc->field_count < NFT_PIPAPO_MIN_FIELDS) return false; if (!boot_cpu_has(X86_FEATURE_AVX2) || !boot_cpu_has(X86_FEATURE_AVX)) return false; est->size = pipapo_estimate_size(desc); if (!est->size) return false; est->lookup = NFT_SET_CLASS_O_LOG_N; est->space = NFT_SET_CLASS_O_N; return true; } /** * pipapo_resmap_init_avx2() - Initialise result map before first use * @m: Matching data, including mapping table * @res_map: Result map * * Like pipapo_resmap_init() but do not set start map bits covered by the first field. */ static inline void pipapo_resmap_init_avx2(const struct nft_pipapo_match *m, unsigned long *res_map) { const struct nft_pipapo_field *f = m->f; int i; /* Starting map doesn't need to be set to all-ones for this implementation, * but we do need to zero the remaining bits, if any. */ for (i = f->bsize; i < m->bsize_max; i++) res_map[i] = 0ul; } /** * nft_pipapo_avx2_lookup() - Lookup function for AVX2 implementation * @net: Network namespace * @set: nftables API set representation * @key: nftables API element representation containing key data * * For more details, see DOC: Theory of Operation in nft_set_pipapo.c. * * This implementation exploits the repetitive characteristic of the algorithm * to provide a fast, vectorised version using the AVX2 SIMD instruction set. * * Return: true on match, false otherwise. */ const struct nft_set_ext * nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set, const u32 *key) { struct nft_pipapo *priv = nft_set_priv(set); const struct nft_set_ext *ext = NULL; struct nft_pipapo_scratch *scratch; const struct nft_pipapo_match *m; const struct nft_pipapo_field *f; const u8 *rp = (const u8 *)key; unsigned long *res, *fill; bool map_index; int i; local_bh_disable(); if (unlikely(!irq_fpu_usable())) { ext = nft_pipapo_lookup(net, set, key); local_bh_enable(); return ext; } m = rcu_dereference(priv->match); /* This also protects access to all data related to scratch maps. * * Note that we don't need a valid MXCSR state for any of the * operations we use here, so pass 0 as mask and spare a LDMXCSR * instruction. */ kernel_fpu_begin_mask(0); scratch = *raw_cpu_ptr(m->scratch); if (unlikely(!scratch)) { kernel_fpu_end(); local_bh_enable(); return NULL; } map_index = scratch->map_index; res = scratch->map + (map_index ? m->bsize_max : 0); fill = scratch->map + (map_index ? 0 : m->bsize_max); pipapo_resmap_init_avx2(m, res); nft_pipapo_avx2_prepare(); next_match: nft_pipapo_for_each_field(f, i, m) { bool last = i == m->field_count - 1, first = !i; int ret = 0; #define NFT_SET_PIPAPO_AVX2_LOOKUP(b, n) \ (ret = nft_pipapo_avx2_lookup_##b##b_##n(res, fill, f, \ ret, rp, \ first, last)) if (likely(f->bb == 8)) { if (f->groups == 1) { NFT_SET_PIPAPO_AVX2_LOOKUP(8, 1); } else if (f->groups == 2) { NFT_SET_PIPAPO_AVX2_LOOKUP(8, 2); } else if (f->groups == 4) { NFT_SET_PIPAPO_AVX2_LOOKUP(8, 4); } else if (f->groups == 6) { NFT_SET_PIPAPO_AVX2_LOOKUP(8, 6); } else if (f->groups == 16) { NFT_SET_PIPAPO_AVX2_LOOKUP(8, 16); } else { ret = nft_pipapo_avx2_lookup_slow(m, res, fill, f, ret, rp, first, last); } } else { if (f->groups == 2) { NFT_SET_PIPAPO_AVX2_LOOKUP(4, 2); } else if (f->groups == 4) { NFT_SET_PIPAPO_AVX2_LOOKUP(4, 4); } else if (f->groups == 8) { NFT_SET_PIPAPO_AVX2_LOOKUP(4, 8); } else if (f->groups == 12) { NFT_SET_PIPAPO_AVX2_LOOKUP(4, 12); } else if (f->groups == 32) { NFT_SET_PIPAPO_AVX2_LOOKUP(4, 32); } else { ret = nft_pipapo_avx2_lookup_slow(m, res, fill, f, ret, rp, first, last); } } NFT_PIPAPO_GROUP_BITS_ARE_8_OR_4; #undef NFT_SET_PIPAPO_AVX2_LOOKUP if (ret < 0) goto out; if (last) { const struct nft_set_ext *e = &f->mt[ret].e->ext; if (unlikely(nft_set_elem_expired(e))) goto next_match; ext = e; goto out; } swap(res, fill); rp += NFT_PIPAPO_GROUPS_PADDED_SIZE(f); } out: if (i % 2) scratch->map_index = !map_index; kernel_fpu_end(); local_bh_enable(); return ext; } |
| 1 82 2774 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef __LINUX_NETLINK_H #define __LINUX_NETLINK_H #include <linux/capability.h> #include <linux/skbuff.h> #include <linux/export.h> #include <net/scm.h> #include <uapi/linux/netlink.h> struct net; void do_trace_netlink_extack(const char *msg); static inline struct nlmsghdr *nlmsg_hdr(const struct sk_buff *skb) { return (struct nlmsghdr *)skb->data; } enum netlink_skb_flags { NETLINK_SKB_DST = 0x8, /* Dst set in sendto or sendmsg */ }; struct netlink_skb_parms { struct scm_creds creds; /* Skb credentials */ __u32 portid; __u32 dst_group; __u32 flags; struct sock *sk; bool nsid_is_set; int nsid; }; #define NETLINK_CB(skb) (*(struct netlink_skb_parms*)&((skb)->cb)) #define NETLINK_CREDS(skb) (&NETLINK_CB((skb)).creds) #define NETLINK_CTX_SIZE 48 void netlink_table_grab(void); void netlink_table_ungrab(void); #define NL_CFG_F_NONROOT_RECV (1 << 0) #define NL_CFG_F_NONROOT_SEND (1 << 1) /* optional Netlink kernel configuration parameters */ struct netlink_kernel_cfg { unsigned int groups; unsigned int flags; void (*input)(struct sk_buff *skb); int (*bind)(struct net *net, int group); void (*unbind)(struct net *net, int group); void (*release) (struct sock *sk, unsigned long *groups); }; struct sock *__netlink_kernel_create(struct net *net, int unit, struct module *module, struct netlink_kernel_cfg *cfg); static inline struct sock * netlink_kernel_create(struct net *net, int unit, struct netlink_kernel_cfg *cfg) { return __netlink_kernel_create(net, unit, THIS_MODULE, cfg); } /* this can be increased when necessary - don't expose to userland */ #define NETLINK_MAX_COOKIE_LEN 8 #define NETLINK_MAX_FMTMSG_LEN 80 /** * struct netlink_ext_ack - netlink extended ACK report struct * @_msg: message string to report - don't access directly, use * %NL_SET_ERR_MSG * @bad_attr: attribute with error * @policy: policy for a bad attribute * @miss_type: attribute type which was missing * @miss_nest: nest missing an attribute (%NULL if missing top level attr) * @cookie: cookie data to return to userspace (for success) * @cookie_len: actual cookie data length * @_msg_buf: output buffer for formatted message strings - don't access * directly, use %NL_SET_ERR_MSG_FMT */ struct netlink_ext_ack { const char *_msg; const struct nlattr *bad_attr; const struct nla_policy *policy; const struct nlattr *miss_nest; u16 miss_type; u8 cookie[NETLINK_MAX_COOKIE_LEN]; u8 cookie_len; char _msg_buf[NETLINK_MAX_FMTMSG_LEN]; }; /* Always use this macro, this allows later putting the * message into a separate section or such for things * like translation or listing all possible messages. * If string formatting is needed use NL_SET_ERR_MSG_FMT. */ #define NL_SET_ERR_MSG(extack, msg) do { \ static const char __msg[] = msg; \ struct netlink_ext_ack *__extack = (extack); \ \ do_trace_netlink_extack(__msg); \ \ if (__extack) \ __extack->_msg = __msg; \ } while (0) /* We splice fmt with %s at each end even in the snprintf so that both calls * can use the same string constant, avoiding its duplication in .ro */ #define NL_SET_ERR_MSG_FMT(extack, fmt, args...) do { \ struct netlink_ext_ack *__extack = (extack); \ \ if (!__extack) \ break; \ if (snprintf(__extack->_msg_buf, NETLINK_MAX_FMTMSG_LEN, \ "%s" fmt "%s", "", ##args, "") >= \ NETLINK_MAX_FMTMSG_LEN) \ net_warn_ratelimited("%s" fmt "%s", "truncated extack: ", \ ##args, "\n"); \ \ do_trace_netlink_extack(__extack->_msg_buf); \ \ __extack->_msg = __extack->_msg_buf; \ } while (0) #define NL_SET_ERR_MSG_MOD(extack, msg) \ NL_SET_ERR_MSG((extack), KBUILD_MODNAME ": " msg) #define NL_SET_ERR_MSG_FMT_MOD(extack, fmt, args...) \ NL_SET_ERR_MSG_FMT((extack), KBUILD_MODNAME ": " fmt, ##args) #define NL_SET_ERR_MSG_WEAK(extack, msg) do { \ if ((extack) && !(extack)->_msg) \ NL_SET_ERR_MSG((extack), msg); \ } while (0) #define NL_SET_ERR_MSG_WEAK_MOD(extack, msg) do { \ if ((extack) && !(extack)->_msg) \ NL_SET_ERR_MSG_MOD((extack), msg); \ } while (0) #define NL_SET_BAD_ATTR_POLICY(extack, attr, pol) do { \ if ((extack)) { \ (extack)->bad_attr = (attr); \ (extack)->policy = (pol); \ } \ } while (0) #define NL_SET_BAD_ATTR(extack, attr) NL_SET_BAD_ATTR_POLICY(extack, attr, NULL) #define NL_SET_ERR_MSG_ATTR_POL(extack, attr, pol, msg) do { \ static const char __msg[] = msg; \ struct netlink_ext_ack *__extack = (extack); \ \ do_trace_netlink_extack(__msg); \ \ if (__extack) { \ __extack->_msg = __msg; \ __extack->bad_attr = (attr); \ __extack->policy = (pol); \ } \ } while (0) #define NL_SET_ERR_MSG_ATTR_POL_FMT(extack, attr, pol, fmt, args...) do { \ struct netlink_ext_ack *__extack = (extack); \ \ if (!__extack) \ break; \ \ if (snprintf(__extack->_msg_buf, NETLINK_MAX_FMTMSG_LEN, \ "%s" fmt "%s", "", ##args, "") >= \ NETLINK_MAX_FMTMSG_LEN) \ net_warn_ratelimited("%s" fmt "%s", "truncated extack: ", \ ##args, "\n"); \ \ do_trace_netlink_extack(__extack->_msg_buf); \ \ __extack->_msg = __extack->_msg_buf; \ __extack->bad_attr = (attr); \ __extack->policy = (pol); \ } while (0) #define NL_SET_ERR_MSG_ATTR(extack, attr, msg) \ NL_SET_ERR_MSG_ATTR_POL(extack, attr, NULL, msg) #define NL_SET_ERR_MSG_ATTR_FMT(extack, attr, msg, args...) \ NL_SET_ERR_MSG_ATTR_POL_FMT(extack, attr, NULL, msg, ##args) #define NL_SET_ERR_ATTR_MISS(extack, nest, type) do { \ struct netlink_ext_ack *__extack = (extack); \ \ if (__extack) { \ __extack->miss_nest = (nest); \ __extack->miss_type = (type); \ } \ } while (0) #define NL_REQ_ATTR_CHECK(extack, nest, tb, type) ({ \ struct nlattr **__tb = (tb); \ u32 __attr = (type); \ int __retval; \ \ __retval = !__tb[__attr]; \ if (__retval) \ NL_SET_ERR_ATTR_MISS((extack), (nest), __attr); \ __retval; \ }) static inline void nl_set_extack_cookie_u64(struct netlink_ext_ack *extack, u64 cookie) { if (!extack) return; BUILD_BUG_ON(sizeof(extack->cookie) < sizeof(cookie)); memcpy(extack->cookie, &cookie, sizeof(cookie)); extack->cookie_len = sizeof(cookie); } void netlink_kernel_release(struct sock *sk); int __netlink_change_ngroups(struct sock *sk, unsigned int groups); int netlink_change_ngroups(struct sock *sk, unsigned int groups); void __netlink_clear_multicast_users(struct sock *sk, unsigned int group); void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err, const struct netlink_ext_ack *extack); int netlink_has_listeners(struct sock *sk, unsigned int group); bool netlink_strict_get_check(struct sk_buff *skb); int netlink_unicast(struct sock *ssk, struct sk_buff *skb, __u32 portid, int nonblock); int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, __u32 portid, __u32 group, gfp_t allocation); typedef int (*netlink_filter_fn)(struct sock *dsk, struct sk_buff *skb, void *data); int netlink_broadcast_filtered(struct sock *ssk, struct sk_buff *skb, __u32 portid, __u32 group, gfp_t allocation, netlink_filter_fn filter, void *filter_data); int netlink_set_err(struct sock *ssk, __u32 portid, __u32 group, int code); int netlink_register_notifier(struct notifier_block *nb); int netlink_unregister_notifier(struct notifier_block *nb); /* finegrained unicast helpers: */ struct sock *netlink_getsockbyfd(int fd); int netlink_attachskb(struct sock *sk, struct sk_buff *skb, long *timeo, struct sock *ssk); void netlink_detachskb(struct sock *sk, struct sk_buff *skb); int netlink_sendskb(struct sock *sk, struct sk_buff *skb); static inline struct sk_buff * netlink_skb_clone(struct sk_buff *skb, gfp_t gfp_mask) { struct sk_buff *nskb; nskb = skb_clone(skb, gfp_mask); if (!nskb) return NULL; /* This is a large skb, set destructor callback to release head */ if (is_vmalloc_addr(skb->head)) nskb->destructor = skb->destructor; return nskb; } /* * skb should fit one page. This choice is good for headerless malloc. * But we should limit to 8K so that userspace does not have to * use enormous buffer sizes on recvmsg() calls just to avoid * MSG_TRUNC when PAGE_SIZE is very large. */ #if PAGE_SIZE < 8192UL #define NLMSG_GOODSIZE SKB_WITH_OVERHEAD(PAGE_SIZE) #else #define NLMSG_GOODSIZE SKB_WITH_OVERHEAD(8192UL) #endif #define NLMSG_DEFAULT_SIZE (NLMSG_GOODSIZE - NLMSG_HDRLEN) struct netlink_callback { struct sk_buff *skb; const struct nlmsghdr *nlh; int (*dump)(struct sk_buff * skb, struct netlink_callback *cb); int (*done)(struct netlink_callback *cb); void *data; /* the module that dump function belong to */ struct module *module; struct netlink_ext_ack *extack; u16 family; u16 answer_flags; u32 min_dump_alloc; unsigned int prev_seq, seq; int flags; bool strict_check; union { u8 ctx[NETLINK_CTX_SIZE]; /* args is deprecated. Cast a struct over ctx instead * for proper type safety. */ long args[6]; }; }; #define NL_ASSERT_CTX_FITS(type_name) \ BUILD_BUG_ON(sizeof(type_name) > \ sizeof_field(struct netlink_callback, ctx)) struct netlink_notify { struct net *net; u32 portid; int protocol; }; struct nlmsghdr * __nlmsg_put(struct sk_buff *skb, u32 portid, u32 seq, int type, int len, int flags); struct netlink_dump_control { int (*start)(struct netlink_callback *); int (*dump)(struct sk_buff *skb, struct netlink_callback *); int (*done)(struct netlink_callback *); struct netlink_ext_ack *extack; void *data; struct module *module; u32 min_dump_alloc; int flags; }; int __netlink_dump_start(struct sock *ssk, struct sk_buff *skb, const struct nlmsghdr *nlh, struct netlink_dump_control *control); static inline int netlink_dump_start(struct sock *ssk, struct sk_buff *skb, const struct nlmsghdr *nlh, struct netlink_dump_control *control) { if (!control->module) control->module = THIS_MODULE; return __netlink_dump_start(ssk, skb, nlh, control); } struct netlink_tap { struct net_device *dev; struct module *module; struct list_head list; }; int netlink_add_tap(struct netlink_tap *nt); int netlink_remove_tap(struct netlink_tap *nt); bool __netlink_ns_capable(const struct netlink_skb_parms *nsp, struct user_namespace *ns, int cap); bool netlink_ns_capable(const struct sk_buff *skb, struct user_namespace *ns, int cap); bool netlink_capable(const struct sk_buff *skb, int cap); bool netlink_net_capable(const struct sk_buff *skb, int cap); struct sk_buff *netlink_alloc_large_skb(unsigned int size, int broadcast); #endif /* __LINUX_NETLINK_H */ |
| 1232 368 958 482 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 | /* SPDX-License-Identifier: GPL-2.0 */ /* * workqueue.h --- work queue handling for Linux. */ #ifndef _LINUX_WORKQUEUE_H #define _LINUX_WORKQUEUE_H #include <linux/alloc_tag.h> #include <linux/timer.h> #include <linux/linkage.h> #include <linux/bitops.h> #include <linux/lockdep.h> #include <linux/threads.h> #include <linux/atomic.h> #include <linux/cpumask_types.h> #include <linux/rcupdate.h> #include <linux/workqueue_types.h> /* * The first word is the work queue pointer and the flags rolled into * one */ #define work_data_bits(work) ((unsigned long *)(&(work)->data)) enum work_bits { WORK_STRUCT_PENDING_BIT = 0, /* work item is pending execution */ WORK_STRUCT_INACTIVE_BIT, /* work item is inactive */ WORK_STRUCT_PWQ_BIT, /* data points to pwq */ WORK_STRUCT_LINKED_BIT, /* next work is linked to this one */ #ifdef CONFIG_DEBUG_OBJECTS_WORK WORK_STRUCT_STATIC_BIT, /* static initializer (debugobjects) */ #endif WORK_STRUCT_FLAG_BITS, /* color for workqueue flushing */ WORK_STRUCT_COLOR_SHIFT = WORK_STRUCT_FLAG_BITS, WORK_STRUCT_COLOR_BITS = 4, /* * When WORK_STRUCT_PWQ is set, reserve 8 bits off of pwq pointer w/ * debugobjects turned off. This makes pwqs aligned to 256 bytes (512 * bytes w/ DEBUG_OBJECTS_WORK) and allows 16 workqueue flush colors. * * MSB * [ pwq pointer ] [ flush color ] [ STRUCT flags ] * 4 bits 4 or 5 bits */ WORK_STRUCT_PWQ_SHIFT = WORK_STRUCT_COLOR_SHIFT + WORK_STRUCT_COLOR_BITS, /* * data contains off-queue information when !WORK_STRUCT_PWQ. * * MSB * [ pool ID ] [ disable depth ] [ OFFQ flags ] [ STRUCT flags ] * 16 bits 1 bit 4 or 5 bits */ WORK_OFFQ_FLAG_SHIFT = WORK_STRUCT_FLAG_BITS, WORK_OFFQ_BH_BIT = WORK_OFFQ_FLAG_SHIFT, WORK_OFFQ_FLAG_END, WORK_OFFQ_FLAG_BITS = WORK_OFFQ_FLAG_END - WORK_OFFQ_FLAG_SHIFT, WORK_OFFQ_DISABLE_SHIFT = WORK_OFFQ_FLAG_SHIFT + WORK_OFFQ_FLAG_BITS, WORK_OFFQ_DISABLE_BITS = 16, /* * When a work item is off queue, the high bits encode off-queue flags * and the last pool it was on. Cap pool ID to 31 bits and use the * highest number to indicate that no pool is associated. */ WORK_OFFQ_POOL_SHIFT = WORK_OFFQ_DISABLE_SHIFT + WORK_OFFQ_DISABLE_BITS, WORK_OFFQ_LEFT = BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT, WORK_OFFQ_POOL_BITS = WORK_OFFQ_LEFT <= 31 ? WORK_OFFQ_LEFT : 31, }; enum work_flags { WORK_STRUCT_PENDING = 1 << WORK_STRUCT_PENDING_BIT, WORK_STRUCT_INACTIVE = 1 << WORK_STRUCT_INACTIVE_BIT, WORK_STRUCT_PWQ = 1 << WORK_STRUCT_PWQ_BIT, WORK_STRUCT_LINKED = 1 << WORK_STRUCT_LINKED_BIT, #ifdef CONFIG_DEBUG_OBJECTS_WORK WORK_STRUCT_STATIC = 1 << WORK_STRUCT_STATIC_BIT, #else WORK_STRUCT_STATIC = 0, #endif }; enum wq_misc_consts { WORK_NR_COLORS = (1 << WORK_STRUCT_COLOR_BITS), /* not bound to any CPU, prefer the local CPU */ WORK_CPU_UNBOUND = NR_CPUS, /* bit mask for work_busy() return values */ WORK_BUSY_PENDING = 1 << 0, WORK_BUSY_RUNNING = 1 << 1, /* maximum string length for set_worker_desc() */ WORKER_DESC_LEN = 32, }; /* Convenience constants - of type 'unsigned long', not 'enum'! */ #define WORK_OFFQ_BH (1ul << WORK_OFFQ_BH_BIT) #define WORK_OFFQ_FLAG_MASK (((1ul << WORK_OFFQ_FLAG_BITS) - 1) << WORK_OFFQ_FLAG_SHIFT) #define WORK_OFFQ_DISABLE_MASK (((1ul << WORK_OFFQ_DISABLE_BITS) - 1) << WORK_OFFQ_DISABLE_SHIFT) #define WORK_OFFQ_POOL_NONE ((1ul << WORK_OFFQ_POOL_BITS) - 1) #define WORK_STRUCT_NO_POOL (WORK_OFFQ_POOL_NONE << WORK_OFFQ_POOL_SHIFT) #define WORK_STRUCT_PWQ_MASK (~((1ul << WORK_STRUCT_PWQ_SHIFT) - 1)) #define WORK_DATA_INIT() ATOMIC_LONG_INIT((unsigned long)WORK_STRUCT_NO_POOL) #define WORK_DATA_STATIC_INIT() \ ATOMIC_LONG_INIT((unsigned long)(WORK_STRUCT_NO_POOL | WORK_STRUCT_STATIC)) struct delayed_work { struct work_struct work; struct timer_list timer; /* target workqueue and CPU ->timer uses to queue ->work */ struct workqueue_struct *wq; int cpu; }; struct rcu_work { struct work_struct work; struct rcu_head rcu; /* target workqueue ->rcu uses to queue ->work */ struct workqueue_struct *wq; }; enum wq_affn_scope { WQ_AFFN_DFL, /* use system default */ WQ_AFFN_CPU, /* one pod per CPU */ WQ_AFFN_SMT, /* one pod poer SMT */ WQ_AFFN_CACHE, /* one pod per LLC */ WQ_AFFN_NUMA, /* one pod per NUMA node */ WQ_AFFN_SYSTEM, /* one pod across the whole system */ WQ_AFFN_NR_TYPES, }; /** * struct workqueue_attrs - A struct for workqueue attributes. * * This can be used to change attributes of an unbound workqueue. */ struct workqueue_attrs { /** * @nice: nice level */ int nice; /** * @cpumask: allowed CPUs * * Work items in this workqueue are affine to these CPUs and not allowed * to execute on other CPUs. A pool serving a workqueue must have the * same @cpumask. */ cpumask_var_t cpumask; /** * @__pod_cpumask: internal attribute used to create per-pod pools * * Internal use only. * * Per-pod unbound worker pools are used to improve locality. Always a * subset of ->cpumask. A workqueue can be associated with multiple * worker pools with disjoint @__pod_cpumask's. Whether the enforcement * of a pool's @__pod_cpumask is strict depends on @affn_strict. */ cpumask_var_t __pod_cpumask; /** * @affn_strict: affinity scope is strict * * If clear, workqueue will make a best-effort attempt at starting the * worker inside @__pod_cpumask but the scheduler is free to migrate it * outside. * * If set, workers are only allowed to run inside @__pod_cpumask. */ bool affn_strict; /* * Below fields aren't properties of a worker_pool. They only modify how * :c:func:`apply_workqueue_attrs` select pools and thus don't * participate in pool hash calculations or equality comparisons. * * If @affn_strict is set, @cpumask isn't a property of a worker_pool * either. */ /** * @affn_scope: unbound CPU affinity scope * * CPU pods are used to improve execution locality of unbound work * items. There are multiple pod types, one for each wq_affn_scope, and * every CPU in the system belongs to one pod in every pod type. CPUs * that belong to the same pod share the worker pool. For example, * selecting %WQ_AFFN_NUMA makes the workqueue use a separate worker * pool for each NUMA node. */ enum wq_affn_scope affn_scope; /** * @ordered: work items must be executed one by one in queueing order */ bool ordered; }; static inline struct delayed_work *to_delayed_work(struct work_struct *work) { return container_of(work, struct delayed_work, work); } static inline struct rcu_work *to_rcu_work(struct work_struct *work) { return container_of(work, struct rcu_work, work); } struct execute_work { struct work_struct work; }; #ifdef CONFIG_LOCKDEP /* * NB: because we have to copy the lockdep_map, setting _key * here is required, otherwise it could get initialised to the * copy of the lockdep_map! */ #define __WORK_INIT_LOCKDEP_MAP(n, k) \ .lockdep_map = STATIC_LOCKDEP_MAP_INIT(n, k), #else #define __WORK_INIT_LOCKDEP_MAP(n, k) #endif #define __WORK_INITIALIZER(n, f) { \ .data = WORK_DATA_STATIC_INIT(), \ .entry = { &(n).entry, &(n).entry }, \ .func = (f), \ __WORK_INIT_LOCKDEP_MAP(#n, &(n)) \ } #define __DELAYED_WORK_INITIALIZER(n, f, tflags) { \ .work = __WORK_INITIALIZER((n).work, (f)), \ .timer = __TIMER_INITIALIZER(delayed_work_timer_fn,\ (tflags) | TIMER_IRQSAFE), \ } #define DECLARE_WORK(n, f) \ struct work_struct n = __WORK_INITIALIZER(n, f) #define DECLARE_DELAYED_WORK(n, f) \ struct delayed_work n = __DELAYED_WORK_INITIALIZER(n, f, 0) #define DECLARE_DEFERRABLE_WORK(n, f) \ struct delayed_work n = __DELAYED_WORK_INITIALIZER(n, f, TIMER_DEFERRABLE) #ifdef CONFIG_DEBUG_OBJECTS_WORK extern void __init_work(struct work_struct *work, int onstack); extern void destroy_work_on_stack(struct work_struct *work); extern void destroy_delayed_work_on_stack(struct delayed_work *work); static inline unsigned int work_static(struct work_struct *work) { return *work_data_bits(work) & WORK_STRUCT_STATIC; } #else static inline void __init_work(struct work_struct *work, int onstack) { } static inline void destroy_work_on_stack(struct work_struct *work) { } static inline void destroy_delayed_work_on_stack(struct delayed_work *work) { } static inline unsigned int work_static(struct work_struct *work) { return 0; } #endif /* * initialize all of a work item in one go * * NOTE! No point in using "atomic_long_set()": using a direct * assignment of the work data initializer allows the compiler * to generate better code. */ #ifdef CONFIG_LOCKDEP #define __INIT_WORK_KEY(_work, _func, _onstack, _key) \ do { \ __init_work((_work), _onstack); \ (_work)->data = (atomic_long_t) WORK_DATA_INIT(); \ lockdep_init_map(&(_work)->lockdep_map, "(work_completion)"#_work, (_key), 0); \ INIT_LIST_HEAD(&(_work)->entry); \ (_work)->func = (_func); \ } while (0) #else #define __INIT_WORK_KEY(_work, _func, _onstack, _key) \ do { \ __init_work((_work), _onstack); \ (_work)->data = (atomic_long_t) WORK_DATA_INIT(); \ INIT_LIST_HEAD(&(_work)->entry); \ (_work)->func = (_func); \ } while (0) #endif #define __INIT_WORK(_work, _func, _onstack) \ do { \ static __maybe_unused struct lock_class_key __key; \ \ __INIT_WORK_KEY(_work, _func, _onstack, &__key); \ } while (0) #define INIT_WORK(_work, _func) \ __INIT_WORK((_work), (_func), 0) #define INIT_WORK_ONSTACK(_work, _func) \ __INIT_WORK((_work), (_func), 1) #define INIT_WORK_ONSTACK_KEY(_work, _func, _key) \ __INIT_WORK_KEY((_work), (_func), 1, _key) #define __INIT_DELAYED_WORK(_work, _func, _tflags) \ do { \ INIT_WORK(&(_work)->work, (_func)); \ __timer_init(&(_work)->timer, \ delayed_work_timer_fn, \ (_tflags) | TIMER_IRQSAFE); \ } while (0) #define __INIT_DELAYED_WORK_ONSTACK(_work, _func, _tflags) \ do { \ INIT_WORK_ONSTACK(&(_work)->work, (_func)); \ __timer_init_on_stack(&(_work)->timer, \ delayed_work_timer_fn, \ (_tflags) | TIMER_IRQSAFE); \ } while (0) #define INIT_DELAYED_WORK(_work, _func) \ __INIT_DELAYED_WORK(_work, _func, 0) #define INIT_DELAYED_WORK_ONSTACK(_work, _func) \ __INIT_DELAYED_WORK_ONSTACK(_work, _func, 0) #define INIT_DEFERRABLE_WORK(_work, _func) \ __INIT_DELAYED_WORK(_work, _func, TIMER_DEFERRABLE) #define INIT_DEFERRABLE_WORK_ONSTACK(_work, _func) \ __INIT_DELAYED_WORK_ONSTACK(_work, _func, TIMER_DEFERRABLE) #define INIT_RCU_WORK(_work, _func) \ INIT_WORK(&(_work)->work, (_func)) #define INIT_RCU_WORK_ONSTACK(_work, _func) \ INIT_WORK_ONSTACK(&(_work)->work, (_func)) /** * work_pending - Find out whether a work item is currently pending * @work: The work item in question */ #define work_pending(work) \ test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)) /** * delayed_work_pending - Find out whether a delayable work item is currently * pending * @w: The work item in question */ #define delayed_work_pending(w) \ work_pending(&(w)->work) /* * Workqueue flags and constants. For details, please refer to * Documentation/core-api/workqueue.rst. */ enum wq_flags { WQ_BH = 1 << 0, /* execute in bottom half (softirq) context */ WQ_UNBOUND = 1 << 1, /* not bound to any cpu */ WQ_FREEZABLE = 1 << 2, /* freeze during suspend */ WQ_MEM_RECLAIM = 1 << 3, /* may be used for memory reclaim */ WQ_HIGHPRI = 1 << 4, /* high priority */ WQ_CPU_INTENSIVE = 1 << 5, /* cpu intensive workqueue */ WQ_SYSFS = 1 << 6, /* visible in sysfs, see workqueue_sysfs_register() */ /* * Per-cpu workqueues are generally preferred because they tend to * show better performance thanks to cache locality. Per-cpu * workqueues exclude the scheduler from choosing the CPU to * execute the worker threads, which has an unfortunate side effect * of increasing power consumption. * * The scheduler considers a CPU idle if it doesn't have any task * to execute and tries to keep idle cores idle to conserve power; * however, for example, a per-cpu work item scheduled from an * interrupt handler on an idle CPU will force the scheduler to * execute the work item on that CPU breaking the idleness, which in * turn may lead to more scheduling choices which are sub-optimal * in terms of power consumption. * * Workqueues marked with WQ_POWER_EFFICIENT are per-cpu by default * but become unbound if workqueue.power_efficient kernel param is * specified. Per-cpu workqueues which are identified to * contribute significantly to power-consumption are identified and * marked with this flag and enabling the power_efficient mode * leads to noticeable power saving at the cost of small * performance disadvantage. * * http://thread.gmane.org/gmane.linux.kernel/1480396 */ WQ_POWER_EFFICIENT = 1 << 7, WQ_PERCPU = 1 << 8, /* bound to a specific cpu */ __WQ_DESTROYING = 1 << 15, /* internal: workqueue is destroying */ __WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */ __WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */ __WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */ /* BH wq only allows the following flags */ __WQ_BH_ALLOWS = WQ_BH | WQ_HIGHPRI, }; enum wq_consts { WQ_MAX_ACTIVE = 2048, /* I like 2048, better ideas? */ WQ_UNBOUND_MAX_ACTIVE = WQ_MAX_ACTIVE, WQ_DFL_ACTIVE = WQ_MAX_ACTIVE / 2, /* * Per-node default cap on min_active. Unless explicitly set, min_active * is set to min(max_active, WQ_DFL_MIN_ACTIVE). For more details, see * workqueue_struct->min_active definition. */ WQ_DFL_MIN_ACTIVE = 8, }; /* * System-wide workqueues which are always present. * * system_percpu_wq is the one used by schedule[_delayed]_work[_on](). * Multi-CPU multi-threaded. There are users which expect relatively * short queue flush time. Don't queue works which can run for too * long. * * system_highpri_wq is similar to system_wq but for work items which * require WQ_HIGHPRI. * * system_long_wq is similar to system_wq but may host long running * works. Queue flushing might take relatively long. * * system_dfl_wq is unbound workqueue. Workers are not bound to * any specific CPU, not concurrency managed, and all queued works are * executed immediately as long as max_active limit is not reached and * resources are available. * * system_freezable_wq is equivalent to system_wq except that it's * freezable. * * *_power_efficient_wq are inclined towards saving power and converted * into WQ_UNBOUND variants if 'wq_power_efficient' is enabled; otherwise, * they are same as their non-power-efficient counterparts - e.g. * system_power_efficient_wq is identical to system_wq if * 'wq_power_efficient' is disabled. See WQ_POWER_EFFICIENT for more info. * * system_bh[_highpri]_wq are convenience interface to softirq. BH work items * are executed in the queueing CPU's BH context in the queueing order. */ extern struct workqueue_struct *system_wq; /* use system_percpu_wq, this will be removed */ extern struct workqueue_struct *system_percpu_wq; extern struct workqueue_struct *system_highpri_wq; extern struct workqueue_struct *system_long_wq; extern struct workqueue_struct *system_unbound_wq; extern struct workqueue_struct *system_dfl_wq; extern struct workqueue_struct *system_freezable_wq; extern struct workqueue_struct *system_power_efficient_wq; extern struct workqueue_struct *system_freezable_power_efficient_wq; extern struct workqueue_struct *system_bh_wq; extern struct workqueue_struct *system_bh_highpri_wq; void workqueue_softirq_action(bool highpri); void workqueue_softirq_dead(unsigned int cpu); /** * alloc_workqueue - allocate a workqueue * @fmt: printf format for the name of the workqueue * @flags: WQ_* flags * @max_active: max in-flight work items, 0 for default * @...: args for @fmt * * For a per-cpu workqueue, @max_active limits the number of in-flight work * items for each CPU. e.g. @max_active of 1 indicates that each CPU can be * executing at most one work item for the workqueue. * * For unbound workqueues, @max_active limits the number of in-flight work items * for the whole system. e.g. @max_active of 16 indicates that there can be * at most 16 work items executing for the workqueue in the whole system. * * As sharing the same active counter for an unbound workqueue across multiple * NUMA nodes can be expensive, @max_active is distributed to each NUMA node * according to the proportion of the number of online CPUs and enforced * independently. * * Depending on online CPU distribution, a node may end up with per-node * max_active which is significantly lower than @max_active, which can lead to * deadlocks if the per-node concurrency limit is lower than the maximum number * of interdependent work items for the workqueue. * * To guarantee forward progress regardless of online CPU distribution, the * concurrency limit on every node is guaranteed to be equal to or greater than * min_active which is set to min(@max_active, %WQ_DFL_MIN_ACTIVE). This means * that the sum of per-node max_active's may be larger than @max_active. * * For detailed information on %WQ_* flags, please refer to * Documentation/core-api/workqueue.rst. * * RETURNS: * Pointer to the allocated workqueue on success, %NULL on failure. */ __printf(1, 4) struct workqueue_struct * alloc_workqueue_noprof(const char *fmt, unsigned int flags, int max_active, ...); #define alloc_workqueue(...) alloc_hooks(alloc_workqueue_noprof(__VA_ARGS__)) #ifdef CONFIG_LOCKDEP /** * alloc_workqueue_lockdep_map - allocate a workqueue with user-defined lockdep_map * @fmt: printf format for the name of the workqueue * @flags: WQ_* flags * @max_active: max in-flight work items, 0 for default * @lockdep_map: user-defined lockdep_map * @...: args for @fmt * * Same as alloc_workqueue but with the a user-define lockdep_map. Useful for * workqueues created with the same purpose and to avoid leaking a lockdep_map * on each workqueue creation. * * RETURNS: * Pointer to the allocated workqueue on success, %NULL on failure. */ __printf(1, 5) struct workqueue_struct * alloc_workqueue_lockdep_map(const char *fmt, unsigned int flags, int max_active, struct lockdep_map *lockdep_map, ...); /** * alloc_ordered_workqueue_lockdep_map - allocate an ordered workqueue with * user-defined lockdep_map * * @fmt: printf format for the name of the workqueue * @flags: WQ_* flags (only WQ_FREEZABLE and WQ_MEM_RECLAIM are meaningful) * @lockdep_map: user-defined lockdep_map * @args: args for @fmt * * Same as alloc_ordered_workqueue but with the a user-define lockdep_map. * Useful for workqueues created with the same purpose and to avoid leaking a * lockdep_map on each workqueue creation. * * RETURNS: * Pointer to the allocated workqueue on success, %NULL on failure. */ #define alloc_ordered_workqueue_lockdep_map(fmt, flags, lockdep_map, args...) \ alloc_hooks(alloc_workqueue_lockdep_map(fmt, WQ_UNBOUND | __WQ_ORDERED | (flags),\ 1, lockdep_map, ##args)) #endif /** * alloc_ordered_workqueue - allocate an ordered workqueue * @fmt: printf format for the name of the workqueue * @flags: WQ_* flags (only WQ_FREEZABLE and WQ_MEM_RECLAIM are meaningful) * @args: args for @fmt * * Allocate an ordered workqueue. An ordered workqueue executes at * most one work item at any given time in the queued order. They are * implemented as unbound workqueues with @max_active of one. * * RETURNS: * Pointer to the allocated workqueue on success, %NULL on failure. */ #define alloc_ordered_workqueue(fmt, flags, args...) \ alloc_workqueue(fmt, WQ_UNBOUND | __WQ_ORDERED | (flags), 1, ##args) #define create_workqueue(name) \ alloc_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, 1, (name)) #define create_freezable_workqueue(name) \ alloc_workqueue("%s", __WQ_LEGACY | WQ_FREEZABLE | WQ_UNBOUND | \ WQ_MEM_RECLAIM, 1, (name)) #define create_singlethread_workqueue(name) \ alloc_ordered_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, name) #define from_work(var, callback_work, work_fieldname) \ container_of(callback_work, typeof(*var), work_fieldname) extern void destroy_workqueue(struct workqueue_struct *wq); struct workqueue_attrs *alloc_workqueue_attrs_noprof(void); #define alloc_workqueue_attrs(...) alloc_hooks(alloc_workqueue_attrs_noprof(__VA_ARGS__)) void free_workqueue_attrs(struct workqueue_attrs *attrs); int apply_workqueue_attrs(struct workqueue_struct *wq, const struct workqueue_attrs *attrs); extern int workqueue_unbound_exclude_cpumask(cpumask_var_t cpumask); extern bool queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work); extern bool queue_work_node(int node, struct workqueue_struct *wq, struct work_struct *work); extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq, struct delayed_work *work, unsigned long delay); extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq, struct delayed_work *dwork, unsigned long delay); extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork); extern void __flush_workqueue(struct workqueue_struct *wq); extern void drain_workqueue(struct workqueue_struct *wq); extern int schedule_on_each_cpu(work_func_t func); int execute_in_process_context(work_func_t fn, struct execute_work *); extern bool flush_work(struct work_struct *work); extern bool cancel_work(struct work_struct *work); extern bool cancel_work_sync(struct work_struct *work); extern bool flush_delayed_work(struct delayed_work *dwork); extern bool cancel_delayed_work(struct delayed_work *dwork); extern bool cancel_delayed_work_sync(struct delayed_work *dwork); extern bool disable_work(struct work_struct *work); extern bool disable_work_sync(struct work_struct *work); extern bool enable_work(struct work_struct *work); extern bool disable_delayed_work(struct delayed_work *dwork); extern bool disable_delayed_work_sync(struct delayed_work *dwork); extern bool enable_delayed_work(struct delayed_work *dwork); extern bool flush_rcu_work(struct rcu_work *rwork); extern void workqueue_set_max_active(struct workqueue_struct *wq, int max_active); extern void workqueue_set_min_active(struct workqueue_struct *wq, int min_active); extern struct work_struct *current_work(void); extern bool current_is_workqueue_rescuer(void); extern bool workqueue_congested(int cpu, struct workqueue_struct *wq); extern unsigned int work_busy(struct work_struct *work); extern __printf(1, 2) void set_worker_desc(const char *fmt, ...); extern void print_worker_info(const char *log_lvl, struct task_struct *task); extern void show_all_workqueues(void); extern void show_freezable_workqueues(void); extern void show_one_workqueue(struct workqueue_struct *wq); extern void wq_worker_comm(char *buf, size_t size, struct task_struct *task); /** * queue_work - queue work on a workqueue * @wq: workqueue to use * @work: work to queue * * Returns %false if @work was already on a queue, %true otherwise. * * We queue the work to the CPU on which it was submitted, but if the CPU dies * it can be processed by another CPU. * * Memory-ordering properties: If it returns %true, guarantees that all stores * preceding the call to queue_work() in the program order will be visible from * the CPU which will execute @work by the time such work executes, e.g., * * { x is initially 0 } * * CPU0 CPU1 * * WRITE_ONCE(x, 1); [ @work is being executed ] * r0 = queue_work(wq, work); r1 = READ_ONCE(x); * * Forbids: r0 == true && r1 == 0 */ static inline bool queue_work(struct workqueue_struct *wq, struct work_struct *work) { return queue_work_on(WORK_CPU_UNBOUND, wq, work); } /** * queue_delayed_work - queue work on a workqueue after delay * @wq: workqueue to use * @dwork: delayable work to queue * @delay: number of jiffies to wait before queueing * * Equivalent to queue_delayed_work_on() but tries to use the local CPU. */ static inline bool queue_delayed_work(struct workqueue_struct *wq, struct delayed_work *dwork, unsigned long delay) { return queue_delayed_work_on(WORK_CPU_UNBOUND, wq, dwork, delay); } /** * mod_delayed_work - modify delay of or queue a delayed work * @wq: workqueue to use * @dwork: work to queue * @delay: number of jiffies to wait before queueing * * mod_delayed_work_on() on local CPU. */ static inline bool mod_delayed_work(struct workqueue_struct *wq, struct delayed_work *dwork, unsigned long delay) { return mod_delayed_work_on(WORK_CPU_UNBOUND, wq, dwork, delay); } /** * schedule_work_on - put work task on a specific cpu * @cpu: cpu to put the work task on * @work: job to be done * * This puts a job on a specific cpu */ static inline bool schedule_work_on(int cpu, struct work_struct *work) { return queue_work_on(cpu, system_wq, work); } /** * schedule_work - put work task in global workqueue * @work: job to be done * * Returns %false if @work was already on the kernel-global workqueue and * %true otherwise. * * This puts a job in the kernel-global workqueue if it was not already * queued and leaves it in the same position on the kernel-global * workqueue otherwise. * * Shares the same memory-ordering properties of queue_work(), cf. the * DocBook header of queue_work(). */ static inline bool schedule_work(struct work_struct *work) { return queue_work(system_wq, work); } /** * enable_and_queue_work - Enable and queue a work item on a specific workqueue * @wq: The target workqueue * @work: The work item to be enabled and queued * * This function combines the operations of enable_work() and queue_work(), * providing a convenient way to enable and queue a work item in a single call. * It invokes enable_work() on @work and then queues it if the disable depth * reached 0. Returns %true if the disable depth reached 0 and @work is queued, * and %false otherwise. * * Note that @work is always queued when disable depth reaches zero. If the * desired behavior is queueing only if certain events took place while @work is * disabled, the user should implement the necessary state tracking and perform * explicit conditional queueing after enable_work(). */ static inline bool enable_and_queue_work(struct workqueue_struct *wq, struct work_struct *work) { if (enable_work(work)) { queue_work(wq, work); return true; } return false; } /* * Detect attempt to flush system-wide workqueues at compile time when possible. * Warn attempt to flush system-wide workqueues at runtime. * * See https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp * for reasons and steps for converting system-wide workqueues into local workqueues. */ extern void __warn_flushing_systemwide_wq(void) __compiletime_warning("Please avoid flushing system-wide workqueues."); /* Please stop using this function, for this function will be removed in near future. */ #define flush_scheduled_work() \ ({ \ __warn_flushing_systemwide_wq(); \ __flush_workqueue(system_wq); \ }) #define flush_workqueue(wq) \ ({ \ struct workqueue_struct *_wq = (wq); \ \ if ((__builtin_constant_p(_wq == system_wq) && \ _wq == system_wq) || \ (__builtin_constant_p(_wq == system_highpri_wq) && \ _wq == system_highpri_wq) || \ (__builtin_constant_p(_wq == system_long_wq) && \ _wq == system_long_wq) || \ (__builtin_constant_p(_wq == system_unbound_wq) && \ _wq == system_unbound_wq) || \ (__builtin_constant_p(_wq == system_freezable_wq) && \ _wq == system_freezable_wq) || \ (__builtin_constant_p(_wq == system_power_efficient_wq) && \ _wq == system_power_efficient_wq) || \ (__builtin_constant_p(_wq == system_freezable_power_efficient_wq) && \ _wq == system_freezable_power_efficient_wq)) \ __warn_flushing_systemwide_wq(); \ __flush_workqueue(_wq); \ }) /** * schedule_delayed_work_on - queue work in global workqueue on CPU after delay * @cpu: cpu to use * @dwork: job to be done * @delay: number of jiffies to wait * * After waiting for a given time this puts a job in the kernel-global * workqueue on the specified CPU. */ static inline bool schedule_delayed_work_on(int cpu, struct delayed_work *dwork, unsigned long delay) { return queue_delayed_work_on(cpu, system_wq, dwork, delay); } /** * schedule_delayed_work - put work task in global workqueue after delay * @dwork: job to be done * @delay: number of jiffies to wait or 0 for immediate execution * * After waiting for a given time this puts a job in the kernel-global * workqueue. */ static inline bool schedule_delayed_work(struct delayed_work *dwork, unsigned long delay) { return queue_delayed_work(system_wq, dwork, delay); } #ifndef CONFIG_SMP static inline long work_on_cpu(int cpu, long (*fn)(void *), void *arg) { return fn(arg); } static inline long work_on_cpu_safe(int cpu, long (*fn)(void *), void *arg) { return fn(arg); } #else long work_on_cpu_key(int cpu, long (*fn)(void *), void *arg, struct lock_class_key *key); /* * A new key is defined for each caller to make sure the work * associated with the function doesn't share its locking class. */ #define work_on_cpu(_cpu, _fn, _arg) \ ({ \ static struct lock_class_key __key; \ \ work_on_cpu_key(_cpu, _fn, _arg, &__key); \ }) #endif /* CONFIG_SMP */ #ifdef CONFIG_FREEZER extern void freeze_workqueues_begin(void); extern bool freeze_workqueues_busy(void); extern void thaw_workqueues(void); #endif /* CONFIG_FREEZER */ #ifdef CONFIG_SYSFS int workqueue_sysfs_register(struct workqueue_struct *wq); #else /* CONFIG_SYSFS */ static inline int workqueue_sysfs_register(struct workqueue_struct *wq) { return 0; } #endif /* CONFIG_SYSFS */ #ifdef CONFIG_WQ_WATCHDOG void wq_watchdog_touch(int cpu); #else /* CONFIG_WQ_WATCHDOG */ static inline void wq_watchdog_touch(int cpu) { } #endif /* CONFIG_WQ_WATCHDOG */ #ifdef CONFIG_SMP int workqueue_prepare_cpu(unsigned int cpu); int workqueue_online_cpu(unsigned int cpu); int workqueue_offline_cpu(unsigned int cpu); #endif void __init workqueue_init_early(void); void __init workqueue_init(void); void __init workqueue_init_topology(void); #endif |
| 450 9 39 296 146 363 320 230 449 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef __NET_GENERIC_NETLINK_H #define __NET_GENERIC_NETLINK_H #include <linux/net.h> #include <net/netlink.h> #include <net/net_namespace.h> #include <uapi/linux/genetlink.h> #define GENLMSG_DEFAULT_SIZE (NLMSG_DEFAULT_SIZE - GENL_HDRLEN) /* Non-parallel generic netlink requests are serialized by a global lock. */ void genl_lock(void); void genl_unlock(void); #define MODULE_ALIAS_GENL_FAMILY(family) \ MODULE_ALIAS_NET_PF_PROTO_NAME(PF_NETLINK, NETLINK_GENERIC, "-family-" family) /* Binding to multicast group requires %CAP_NET_ADMIN */ #define GENL_MCAST_CAP_NET_ADMIN BIT(0) /* Binding to multicast group requires %CAP_SYS_ADMIN */ #define GENL_MCAST_CAP_SYS_ADMIN BIT(1) /** * struct genl_multicast_group - generic netlink multicast group * @name: name of the multicast group, names are per-family * @flags: GENL_MCAST_* flags */ struct genl_multicast_group { char name[GENL_NAMSIZ]; u8 flags; }; struct genl_split_ops; struct genl_info; /** * struct genl_family - generic netlink family * @hdrsize: length of user specific header in bytes * @name: name of family * @version: protocol version * @maxattr: maximum number of attributes supported * @policy: netlink policy * @netnsok: set to true if the family can handle network * namespaces and should be presented in all of them * @parallel_ops: operations can be called in parallel and aren't * synchronized by the core genetlink code * @pre_doit: called before an operation's doit callback, it may * do additional, common, filtering and return an error * @post_doit: called after an operation's doit callback, it may * undo operations done by pre_doit, for example release locks * @bind: called when family multicast group is added to a netlink socket * @unbind: called when family multicast group is removed from a netlink socket * @module: pointer to the owning module (set to THIS_MODULE) * @mcgrps: multicast groups used by this family * @n_mcgrps: number of multicast groups * @resv_start_op: first operation for which reserved fields of the header * can be validated and policies are required (see below); * new families should leave this field at zero * @ops: the operations supported by this family * @n_ops: number of operations supported by this family * @small_ops: the small-struct operations supported by this family * @n_small_ops: number of small-struct operations supported by this family * @split_ops: the split do/dump form of operation definition * @n_split_ops: number of entries in @split_ops, not that with split do/dump * ops the number of entries is not the same as number of commands * @sock_priv_size: the size of per-socket private memory * @sock_priv_init: the per-socket private memory initializer * @sock_priv_destroy: the per-socket private memory destructor * * Attribute policies (the combination of @policy and @maxattr fields) * can be attached at the family level or at the operation level. * If both are present the per-operation policy takes precedence. * For operations before @resv_start_op lack of policy means that the core * will perform no attribute parsing or validation. For newer operations * if policy is not provided core will reject all TLV attributes. */ struct genl_family { unsigned int hdrsize; char name[GENL_NAMSIZ]; unsigned int version; unsigned int maxattr; u8 netnsok:1; u8 parallel_ops:1; u8 n_ops; u8 n_small_ops; u8 n_split_ops; u8 n_mcgrps; u8 resv_start_op; const struct nla_policy *policy; int (*pre_doit)(const struct genl_split_ops *ops, struct sk_buff *skb, struct genl_info *info); void (*post_doit)(const struct genl_split_ops *ops, struct sk_buff *skb, struct genl_info *info); int (*bind)(int mcgrp); void (*unbind)(int mcgrp); const struct genl_ops * ops; const struct genl_small_ops *small_ops; const struct genl_split_ops *split_ops; const struct genl_multicast_group *mcgrps; struct module *module; size_t sock_priv_size; void (*sock_priv_init)(void *priv); void (*sock_priv_destroy)(void *priv); /* private: internal use only */ /* protocol family identifier */ int id; /* starting number of multicast group IDs in this family */ unsigned int mcgrp_offset; /* list of per-socket privs */ struct xarray *sock_privs; }; /** * struct genl_info - receiving information * @snd_seq: sending sequence number * @snd_portid: netlink portid of sender * @family: generic netlink family * @nlhdr: netlink message header * @genlhdr: generic netlink message header * @attrs: netlink attributes * @_net: network namespace * @ctx: storage space for the use by the family * @user_ptr: user pointers (deprecated, use ctx instead) * @extack: extended ACK report struct */ struct genl_info { u32 snd_seq; u32 snd_portid; const struct genl_family *family; const struct nlmsghdr * nlhdr; struct genlmsghdr * genlhdr; struct nlattr ** attrs; possible_net_t _net; union { u8 ctx[NETLINK_CTX_SIZE]; void * user_ptr[2]; }; struct netlink_ext_ack *extack; }; static inline struct net *genl_info_net(const struct genl_info *info) { return read_pnet(&info->_net); } static inline void genl_info_net_set(struct genl_info *info, struct net *net) { write_pnet(&info->_net, net); } static inline void *genl_info_userhdr(const struct genl_info *info) { return (u8 *)info->genlhdr + GENL_HDRLEN; } #define GENL_SET_ERR_MSG(info, msg) NL_SET_ERR_MSG((info)->extack, msg) #define GENL_SET_ERR_MSG_FMT(info, msg, args...) \ NL_SET_ERR_MSG_FMT((info)->extack, msg, ##args) /* Report that a root attribute is missing */ #define GENL_REQ_ATTR_CHECK(info, attr) ({ \ const struct genl_info *__info = (info); \ \ NL_REQ_ATTR_CHECK(__info->extack, NULL, __info->attrs, (attr)); \ }) enum genl_validate_flags { GENL_DONT_VALIDATE_STRICT = BIT(0), GENL_DONT_VALIDATE_DUMP = BIT(1), GENL_DONT_VALIDATE_DUMP_STRICT = BIT(2), }; /** * struct genl_small_ops - generic netlink operations (small version) * @cmd: command identifier * @internal_flags: flags used by the family * @flags: GENL_* flags (%GENL_ADMIN_PERM or %GENL_UNS_ADMIN_PERM) * @validate: validation flags from enum genl_validate_flags * @doit: standard command callback * @dumpit: callback for dumpers * * This is a cut-down version of struct genl_ops for users who don't need * most of the ancillary infra and want to save space. */ struct genl_small_ops { int (*doit)(struct sk_buff *skb, struct genl_info *info); int (*dumpit)(struct sk_buff *skb, struct netlink_callback *cb); u8 cmd; u8 internal_flags; u8 flags; u8 validate; }; /** * struct genl_ops - generic netlink operations * @cmd: command identifier * @internal_flags: flags used by the family * @flags: GENL_* flags (%GENL_ADMIN_PERM or %GENL_UNS_ADMIN_PERM) * @maxattr: maximum number of attributes supported * @policy: netlink policy (takes precedence over family policy) * @validate: validation flags from enum genl_validate_flags * @doit: standard command callback * @start: start callback for dumps * @dumpit: callback for dumpers * @done: completion callback for dumps */ struct genl_ops { int (*doit)(struct sk_buff *skb, struct genl_info *info); int (*start)(struct netlink_callback *cb); int (*dumpit)(struct sk_buff *skb, struct netlink_callback *cb); int (*done)(struct netlink_callback *cb); const struct nla_policy *policy; unsigned int maxattr; u8 cmd; u8 internal_flags; u8 flags; u8 validate; }; /** * struct genl_split_ops - generic netlink operations (do/dump split version) * @cmd: command identifier * @internal_flags: flags used by the family * @flags: GENL_* flags (%GENL_ADMIN_PERM or %GENL_UNS_ADMIN_PERM) * @validate: validation flags from enum genl_validate_flags * @policy: netlink policy (takes precedence over family policy) * @maxattr: maximum number of attributes supported * * Do callbacks: * @pre_doit: called before an operation's @doit callback, it may * do additional, common, filtering and return an error * @doit: standard command callback * @post_doit: called after an operation's @doit callback, it may * undo operations done by pre_doit, for example release locks * * Dump callbacks: * @start: start callback for dumps * @dumpit: callback for dumpers * @done: completion callback for dumps * * Do callbacks can be used if %GENL_CMD_CAP_DO is set in @flags. * Dump callbacks can be used if %GENL_CMD_CAP_DUMP is set in @flags. * Exactly one of those flags must be set. */ struct genl_split_ops { union { struct { int (*pre_doit)(const struct genl_split_ops *ops, struct sk_buff *skb, struct genl_info *info); int (*doit)(struct sk_buff *skb, struct genl_info *info); void (*post_doit)(const struct genl_split_ops *ops, struct sk_buff *skb, struct genl_info *info); }; struct { int (*start)(struct netlink_callback *cb); int (*dumpit)(struct sk_buff *skb, struct netlink_callback *cb); int (*done)(struct netlink_callback *cb); }; }; const struct nla_policy *policy; unsigned int maxattr; u8 cmd; u8 internal_flags; u8 flags; u8 validate; }; /** * struct genl_dumpit_info - info that is available during dumpit op call * @op: generic netlink ops - for internal genl code usage * @attrs: netlink attributes * @info: struct genl_info describing the request */ struct genl_dumpit_info { struct genl_split_ops op; struct genl_info info; }; static inline const struct genl_dumpit_info * genl_dumpit_info(struct netlink_callback *cb) { return cb->data; } static inline const struct genl_info * genl_info_dump(struct netlink_callback *cb) { return &genl_dumpit_info(cb)->info; } /** * genl_info_init_ntf() - initialize genl_info for notifications * @info: genl_info struct to set up * @family: pointer to the genetlink family * @cmd: command to be used in the notification * * Initialize a locally declared struct genl_info to pass to various APIs. * Intended to be used when creating notifications. */ static inline void genl_info_init_ntf(struct genl_info *info, const struct genl_family *family, u8 cmd) { struct genlmsghdr *hdr = (void *) &info->user_ptr[0]; memset(info, 0, sizeof(*info)); info->family = family; info->genlhdr = hdr; hdr->cmd = cmd; } static inline bool genl_info_is_ntf(const struct genl_info *info) { return !info->nlhdr; } void *__genl_sk_priv_get(struct genl_family *family, struct sock *sk); void *genl_sk_priv_get(struct genl_family *family, struct sock *sk); int genl_register_family(struct genl_family *family); int genl_unregister_family(const struct genl_family *family); void genl_notify(const struct genl_family *family, struct sk_buff *skb, struct genl_info *info, u32 group, gfp_t flags); void *genlmsg_put(struct sk_buff *skb, u32 portid, u32 seq, const struct genl_family *family, int flags, u8 cmd); static inline void * __genlmsg_iput(struct sk_buff *skb, const struct genl_info *info, int flags) { return genlmsg_put(skb, info->snd_portid, info->snd_seq, info->family, flags, info->genlhdr->cmd); } /** * genlmsg_iput - start genetlink message based on genl_info * @skb: skb in which message header will be placed * @info: genl_info as provided to do/dump handlers * * Convenience wrapper which starts a genetlink message based on * information in user request. @info should be either the struct passed * by genetlink core to do/dump handlers (when constructing replies to * such requests) or a struct initialized by genl_info_init_ntf() * when constructing notifications. * * Returns: pointer to new genetlink header. */ static inline void * genlmsg_iput(struct sk_buff *skb, const struct genl_info *info) { return __genlmsg_iput(skb, info, 0); } /** * genlmsg_nlhdr - Obtain netlink header from user specified header * @user_hdr: user header as returned from genlmsg_put() * * Returns: pointer to netlink header. */ static inline struct nlmsghdr *genlmsg_nlhdr(void *user_hdr) { return (struct nlmsghdr *)((char *)user_hdr - GENL_HDRLEN - NLMSG_HDRLEN); } /** * genlmsg_parse_deprecated - parse attributes of a genetlink message * @nlh: netlink message header * @family: genetlink message family * @tb: destination array with maxtype+1 elements * @maxtype: maximum attribute type to be expected * @policy: validation policy * @extack: extended ACK report struct */ static inline int genlmsg_parse_deprecated(const struct nlmsghdr *nlh, const struct genl_family *family, struct nlattr *tb[], int maxtype, const struct nla_policy *policy, struct netlink_ext_ack *extack) { return __nlmsg_parse(nlh, family->hdrsize + GENL_HDRLEN, tb, maxtype, policy, NL_VALIDATE_LIBERAL, extack); } /** * genlmsg_parse - parse attributes of a genetlink message * @nlh: netlink message header * @family: genetlink message family * @tb: destination array with maxtype+1 elements * @maxtype: maximum attribute type to be expected * @policy: validation policy * @extack: extended ACK report struct */ static inline int genlmsg_parse(const struct nlmsghdr *nlh, const struct genl_family *family, struct nlattr *tb[], int maxtype, const struct nla_policy *policy, struct netlink_ext_ack *extack) { return __nlmsg_parse(nlh, family->hdrsize + GENL_HDRLEN, tb, maxtype, policy, NL_VALIDATE_STRICT, extack); } /** * genl_dump_check_consistent - check if sequence is consistent and advertise if not * @cb: netlink callback structure that stores the sequence number * @user_hdr: user header as returned from genlmsg_put() * * Cf. nl_dump_check_consistent(), this just provides a wrapper to make it * simpler to use with generic netlink. */ static inline void genl_dump_check_consistent(struct netlink_callback *cb, void *user_hdr) { nl_dump_check_consistent(cb, genlmsg_nlhdr(user_hdr)); } /** * genlmsg_put_reply - Add generic netlink header to a reply message * @skb: socket buffer holding the message * @info: receiver info * @family: generic netlink family * @flags: netlink message flags * @cmd: generic netlink command * * Returns: pointer to user specific header */ static inline void *genlmsg_put_reply(struct sk_buff *skb, struct genl_info *info, const struct genl_family *family, int flags, u8 cmd) { return genlmsg_put(skb, info->snd_portid, info->snd_seq, family, flags, cmd); } /** * genlmsg_end - Finalize a generic netlink message * @skb: socket buffer the message is stored in * @hdr: user specific header */ static inline void genlmsg_end(struct sk_buff *skb, void *hdr) { nlmsg_end(skb, hdr - GENL_HDRLEN - NLMSG_HDRLEN); } /** * genlmsg_cancel - Cancel construction of a generic netlink message * @skb: socket buffer the message is stored in * @hdr: generic netlink message header */ static inline void genlmsg_cancel(struct sk_buff *skb, void *hdr) { if (hdr) nlmsg_cancel(skb, hdr - GENL_HDRLEN - NLMSG_HDRLEN); } /** * genlmsg_multicast_netns_filtered - multicast a netlink message * to a specific netns with filter * function * @family: the generic netlink family * @net: the net namespace * @skb: netlink message as socket buffer * @portid: own netlink portid to avoid sending to yourself * @group: offset of multicast group in groups array * @flags: allocation flags * @filter: filter function * @filter_data: filter function private data * * Return: 0 on success, negative error code for failure. */ static inline int genlmsg_multicast_netns_filtered(const struct genl_family *family, struct net *net, struct sk_buff *skb, u32 portid, unsigned int group, gfp_t flags, netlink_filter_fn filter, void *filter_data) { if (WARN_ON_ONCE(group >= family->n_mcgrps)) return -EINVAL; group = family->mcgrp_offset + group; return nlmsg_multicast_filtered(net->genl_sock, skb, portid, group, flags, filter, filter_data); } /** * genlmsg_multicast_netns - multicast a netlink message to a specific netns * @family: the generic netlink family * @net: the net namespace * @skb: netlink message as socket buffer * @portid: own netlink portid to avoid sending to yourself * @group: offset of multicast group in groups array * @flags: allocation flags */ static inline int genlmsg_multicast_netns(const struct genl_family *family, struct net *net, struct sk_buff *skb, u32 portid, unsigned int group, gfp_t flags) { return genlmsg_multicast_netns_filtered(family, net, skb, portid, group, flags, NULL, NULL); } /** * genlmsg_multicast - multicast a netlink message to the default netns * @family: the generic netlink family * @skb: netlink message as socket buffer * @portid: own netlink portid to avoid sending to yourself * @group: offset of multicast group in groups array * @flags: allocation flags */ static inline int genlmsg_multicast(const struct genl_family *family, struct sk_buff *skb, u32 portid, unsigned int group, gfp_t flags) { return genlmsg_multicast_netns(family, &init_net, skb, portid, group, flags); } /** * genlmsg_multicast_allns - multicast a netlink message to all net namespaces * @family: the generic netlink family * @skb: netlink message as socket buffer * @portid: own netlink portid to avoid sending to yourself * @group: offset of multicast group in groups array * * This function must hold the RTNL or rcu_read_lock(). */ int genlmsg_multicast_allns(const struct genl_family *family, struct sk_buff *skb, u32 portid, unsigned int group); /** * genlmsg_unicast - unicast a netlink message * @net: network namespace to look up @portid in * @skb: netlink message as socket buffer * @portid: netlink portid of the destination socket */ static inline int genlmsg_unicast(struct net *net, struct sk_buff *skb, u32 portid) { return nlmsg_unicast(net->genl_sock, skb, portid); } /** * genlmsg_reply - reply to a request * @skb: netlink message to be sent back * @info: receiver information */ static inline int genlmsg_reply(struct sk_buff *skb, struct genl_info *info) { return genlmsg_unicast(genl_info_net(info), skb, info->snd_portid); } /** * genlmsg_data - head of message payload * @gnlh: genetlink message header */ static inline void *genlmsg_data(const struct genlmsghdr *gnlh) { return ((unsigned char *) gnlh + GENL_HDRLEN); } /** * genlmsg_len - length of message payload * @gnlh: genetlink message header */ static inline int genlmsg_len(const struct genlmsghdr *gnlh) { struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh - NLMSG_HDRLEN); return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN); } /** * genlmsg_msg_size - length of genetlink message not including padding * @payload: length of message payload */ static inline int genlmsg_msg_size(int payload) { return GENL_HDRLEN + payload; } /** * genlmsg_total_size - length of genetlink message including padding * @payload: length of message payload */ static inline int genlmsg_total_size(int payload) { return NLMSG_ALIGN(genlmsg_msg_size(payload)); } /** * genlmsg_new - Allocate a new generic netlink message * @payload: size of the message payload * @flags: the type of memory to allocate. */ static inline struct sk_buff *genlmsg_new(size_t payload, gfp_t flags) { return nlmsg_new(genlmsg_total_size(payload), flags); } /** * genl_set_err - report error to genetlink broadcast listeners * @family: the generic netlink family * @net: the network namespace to report the error to * @portid: the PORTID of a process that we want to skip (if any) * @group: the broadcast group that will notice the error * (this is the offset of the multicast group in the groups array) * @code: error code, must be negative (as usual in kernelspace) * * This function returns the number of broadcast listeners that have set the * NETLINK_RECV_NO_ENOBUFS socket option. */ static inline int genl_set_err(const struct genl_family *family, struct net *net, u32 portid, u32 group, int code) { if (WARN_ON_ONCE(group >= family->n_mcgrps)) return -EINVAL; group = family->mcgrp_offset + group; return netlink_set_err(net->genl_sock, portid, group, code); } static inline int genl_has_listeners(const struct genl_family *family, struct net *net, unsigned int group) { if (WARN_ON_ONCE(group >= family->n_mcgrps)) return -EINVAL; group = family->mcgrp_offset + group; return netlink_has_listeners(net->genl_sock, group); } #endif /* __NET_GENERIC_NETLINK_H */ |
| 650 158 7122 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | /* SPDX-License-Identifier: GPL-2.0 */ #undef TRACE_SYSTEM #define TRACE_SYSTEM notifier #if !defined(_TRACE_NOTIFIERS_H) || defined(TRACE_HEADER_MULTI_READ) #define _TRACE_NOTIFIERS_H #include <linux/tracepoint.h> DECLARE_EVENT_CLASS(notifier_info, TP_PROTO(void *cb), TP_ARGS(cb), TP_STRUCT__entry( __field(void *, cb) ), TP_fast_assign( __entry->cb = cb; ), TP_printk("%ps", __entry->cb) ); /* * notifier_register - called upon notifier callback registration * * @cb: callback pointer * */ DEFINE_EVENT(notifier_info, notifier_register, TP_PROTO(void *cb), TP_ARGS(cb) ); /* * notifier_unregister - called upon notifier callback unregistration * * @cb: callback pointer * */ DEFINE_EVENT(notifier_info, notifier_unregister, TP_PROTO(void *cb), TP_ARGS(cb) ); /* * notifier_run - called upon notifier callback execution * * @cb: callback pointer * */ DEFINE_EVENT(notifier_info, notifier_run, TP_PROTO(void *cb), TP_ARGS(cb) ); #endif /* _TRACE_NOTIFIERS_H */ /* This part must be outside protection */ #include <trace/define_trace.h> |
| 183 183 378 378 79 79 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 | // SPDX-License-Identifier: GPL-2.0 /* Device wakeirq helper functions */ #include <linux/device.h> #include <linux/interrupt.h> #include <linux/irq.h> #include <linux/slab.h> #include <linux/pm_runtime.h> #include <linux/pm_wakeirq.h> #include "power.h" /** * dev_pm_attach_wake_irq - Attach device interrupt as a wake IRQ * @dev: Device entry * @wirq: Wake irq specific data * * Internal function to attach a dedicated wake-up interrupt as a wake IRQ. */ static int dev_pm_attach_wake_irq(struct device *dev, struct wake_irq *wirq) { unsigned long flags; if (!dev || !wirq) return -EINVAL; spin_lock_irqsave(&dev->power.lock, flags); if (dev_WARN_ONCE(dev, dev->power.wakeirq, "wake irq already initialized\n")) { spin_unlock_irqrestore(&dev->power.lock, flags); return -EEXIST; } dev->power.wakeirq = wirq; device_wakeup_attach_irq(dev, wirq); spin_unlock_irqrestore(&dev->power.lock, flags); return 0; } /** * dev_pm_set_wake_irq - Attach device IO interrupt as wake IRQ * @dev: Device entry * @irq: Device IO interrupt * * Attach a device IO interrupt as a wake IRQ. The wake IRQ gets * automatically configured for wake-up from suspend based * on the device specific sysfs wakeup entry. Typically called * during driver probe after calling device_init_wakeup(). */ int dev_pm_set_wake_irq(struct device *dev, int irq) { struct wake_irq *wirq; int err; if (irq < 0) return -EINVAL; wirq = kzalloc(sizeof(*wirq), GFP_KERNEL); if (!wirq) return -ENOMEM; wirq->dev = dev; wirq->irq = irq; err = dev_pm_attach_wake_irq(dev, wirq); if (err) kfree(wirq); return err; } EXPORT_SYMBOL_GPL(dev_pm_set_wake_irq); /** * dev_pm_clear_wake_irq - Detach a device IO interrupt wake IRQ * @dev: Device entry * * Detach a device wake IRQ and free resources. * * Note that it's OK for drivers to call this without calling * dev_pm_set_wake_irq() as all the driver instances may not have * a wake IRQ configured. This avoid adding wake IRQ specific * checks into the drivers. */ void dev_pm_clear_wake_irq(struct device *dev) { struct wake_irq *wirq = dev->power.wakeirq; unsigned long flags; if (!wirq) return; spin_lock_irqsave(&dev->power.lock, flags); device_wakeup_detach_irq(dev); dev->power.wakeirq = NULL; spin_unlock_irqrestore(&dev->power.lock, flags); if (wirq->status & WAKE_IRQ_DEDICATED_ALLOCATED) { free_irq(wirq->irq, wirq); wirq->status &= ~WAKE_IRQ_DEDICATED_MASK; } kfree(wirq->name); kfree(wirq); } EXPORT_SYMBOL_GPL(dev_pm_clear_wake_irq); static void devm_pm_clear_wake_irq(void *dev) { dev_pm_clear_wake_irq(dev); } /** * devm_pm_set_wake_irq - device-managed variant of dev_pm_set_wake_irq * @dev: Device entry * @irq: Device IO interrupt * * * Attach a device IO interrupt as a wake IRQ, same with dev_pm_set_wake_irq, * but the device will be auto clear wake capability on driver detach. */ int devm_pm_set_wake_irq(struct device *dev, int irq) { int ret; ret = dev_pm_set_wake_irq(dev, irq); if (ret) return ret; return devm_add_action_or_reset(dev, devm_pm_clear_wake_irq, dev); } EXPORT_SYMBOL_GPL(devm_pm_set_wake_irq); /** * handle_threaded_wake_irq - Handler for dedicated wake-up interrupts * @irq: Device specific dedicated wake-up interrupt * @_wirq: Wake IRQ data * * Some devices have a separate wake-up interrupt in addition to the * device IO interrupt. The wake-up interrupt signals that a device * should be woken up from it's idle state. This handler uses device * specific pm_runtime functions to wake the device, and then it's * up to the device to do whatever it needs to. Note that as the * device may need to restore context and start up regulators, we * use a threaded IRQ. * * Also note that we are not resending the lost device interrupts. * We assume that the wake-up interrupt just needs to wake-up the * device, and then device's pm_runtime_resume() can deal with the * situation. */ static irqreturn_t handle_threaded_wake_irq(int irq, void *_wirq) { struct wake_irq *wirq = _wirq; int res; /* Maybe abort suspend? */ if (irqd_is_wakeup_set(irq_get_irq_data(irq))) { pm_wakeup_event(wirq->dev, 0); return IRQ_HANDLED; } /* We don't want RPM_ASYNC or RPM_NOWAIT here */ res = pm_runtime_resume(wirq->dev); if (res < 0) dev_warn(wirq->dev, "wake IRQ with no resume: %i\n", res); return IRQ_HANDLED; } static int __dev_pm_set_dedicated_wake_irq(struct device *dev, int irq, unsigned int flag) { struct wake_irq *wirq; int err; if (irq < 0) return -EINVAL; wirq = kzalloc(sizeof(*wirq), GFP_KERNEL); if (!wirq) return -ENOMEM; wirq->name = kasprintf(GFP_KERNEL, "%s:wakeup", dev_name(dev)); if (!wirq->name) { err = -ENOMEM; goto err_free; } wirq->dev = dev; wirq->irq = irq; /* Prevent deferred spurious wakeirqs with disable_irq_nosync() */ irq_set_status_flags(irq, IRQ_DISABLE_UNLAZY); /* * Consumer device may need to power up and restore state * so we use a threaded irq. */ err = request_threaded_irq(irq, NULL, handle_threaded_wake_irq, IRQF_ONESHOT | IRQF_NO_AUTOEN, wirq->name, wirq); if (err) goto err_free_name; err = dev_pm_attach_wake_irq(dev, wirq); if (err) goto err_free_irq; wirq->status = WAKE_IRQ_DEDICATED_ALLOCATED | flag; return err; err_free_irq: free_irq(irq, wirq); err_free_name: kfree(wirq->name); err_free: kfree(wirq); return err; } /** * dev_pm_set_dedicated_wake_irq - Request a dedicated wake-up interrupt * @dev: Device entry * @irq: Device wake-up interrupt * * Unless your hardware has separate wake-up interrupts in addition * to the device IO interrupts, you don't need this. * * Sets up a threaded interrupt handler for a device that has * a dedicated wake-up interrupt in addition to the device IO * interrupt. */ int dev_pm_set_dedicated_wake_irq(struct device *dev, int irq) { return __dev_pm_set_dedicated_wake_irq(dev, irq, 0); } EXPORT_SYMBOL_GPL(dev_pm_set_dedicated_wake_irq); /** * dev_pm_set_dedicated_wake_irq_reverse - Request a dedicated wake-up interrupt * with reverse enable ordering * @dev: Device entry * @irq: Device wake-up interrupt * * Unless your hardware has separate wake-up interrupts in addition * to the device IO interrupts, you don't need this. * * Sets up a threaded interrupt handler for a device that has a dedicated * wake-up interrupt in addition to the device IO interrupt. It sets * the status of WAKE_IRQ_DEDICATED_REVERSE to tell rpm_suspend() * to enable dedicated wake-up interrupt after running the runtime suspend * callback for @dev. */ int dev_pm_set_dedicated_wake_irq_reverse(struct device *dev, int irq) { return __dev_pm_set_dedicated_wake_irq(dev, irq, WAKE_IRQ_DEDICATED_REVERSE); } EXPORT_SYMBOL_GPL(dev_pm_set_dedicated_wake_irq_reverse); /** * dev_pm_enable_wake_irq_check - Checks and enables wake-up interrupt * @dev: Device * @can_change_status: Can change wake-up interrupt status * * Enables wakeirq conditionally. We need to enable wake-up interrupt * lazily on the first rpm_suspend(). This is needed as the consumer device * starts in RPM_SUSPENDED state, and the first pm_runtime_get() would * otherwise try to disable already disabled wakeirq. The wake-up interrupt * starts disabled with IRQ_NOAUTOEN set. * * Should be only called from rpm_suspend() and rpm_resume() path. * Caller must hold &dev->power.lock to change wirq->status */ void dev_pm_enable_wake_irq_check(struct device *dev, bool can_change_status) { struct wake_irq *wirq = dev->power.wakeirq; if (!wirq || !(wirq->status & WAKE_IRQ_DEDICATED_MASK)) return; if (likely(wirq->status & WAKE_IRQ_DEDICATED_MANAGED)) { goto enable; } else if (can_change_status) { wirq->status |= WAKE_IRQ_DEDICATED_MANAGED; goto enable; } return; enable: if (!can_change_status || !(wirq->status & WAKE_IRQ_DEDICATED_REVERSE)) { enable_irq(wirq->irq); wirq->status |= WAKE_IRQ_DEDICATED_ENABLED; } } /** * dev_pm_disable_wake_irq_check - Checks and disables wake-up interrupt * @dev: Device * @cond_disable: if set, also check WAKE_IRQ_DEDICATED_REVERSE * * Disables wake-up interrupt conditionally based on status. * Should be only called from rpm_suspend() and rpm_resume() path. */ void dev_pm_disable_wake_irq_check(struct device *dev, bool cond_disable) { struct wake_irq *wirq = dev->power.wakeirq; if (!wirq || !(wirq->status & WAKE_IRQ_DEDICATED_MASK)) return; if (cond_disable && (wirq->status & WAKE_IRQ_DEDICATED_REVERSE)) return; if (wirq->status & WAKE_IRQ_DEDICATED_MANAGED) { wirq->status &= ~WAKE_IRQ_DEDICATED_ENABLED; disable_irq_nosync(wirq->irq); } } /** * dev_pm_enable_wake_irq_complete - enable wake IRQ not enabled before * @dev: Device using the wake IRQ * * Enable wake IRQ conditionally based on status, mainly used if want to * enable wake IRQ after running ->runtime_suspend() which depends on * WAKE_IRQ_DEDICATED_REVERSE. * * Should be only called from rpm_suspend() path. */ void dev_pm_enable_wake_irq_complete(struct device *dev) { struct wake_irq *wirq = dev->power.wakeirq; if (!wirq || !(wirq->status & WAKE_IRQ_DEDICATED_MASK)) return; if (wirq->status & WAKE_IRQ_DEDICATED_MANAGED && wirq->status & WAKE_IRQ_DEDICATED_REVERSE) { enable_irq(wirq->irq); wirq->status |= WAKE_IRQ_DEDICATED_ENABLED; } } /** * dev_pm_arm_wake_irq - Arm device wake-up * @wirq: Device wake-up interrupt * * Sets up the wake-up event conditionally based on the * device_may_wake(). */ void dev_pm_arm_wake_irq(struct wake_irq *wirq) { if (!wirq) return; if (device_may_wakeup(wirq->dev)) { if (wirq->status & WAKE_IRQ_DEDICATED_ALLOCATED && !(wirq->status & WAKE_IRQ_DEDICATED_ENABLED)) enable_irq(wirq->irq); enable_irq_wake(wirq->irq); } } /** * dev_pm_disarm_wake_irq - Disarm device wake-up * @wirq: Device wake-up interrupt * * Clears up the wake-up event conditionally based on the * device_may_wake(). */ void dev_pm_disarm_wake_irq(struct wake_irq *wirq) { if (!wirq) return; if (device_may_wakeup(wirq->dev)) { disable_irq_wake(wirq->irq); if (wirq->status & WAKE_IRQ_DEDICATED_ALLOCATED && !(wirq->status & WAKE_IRQ_DEDICATED_ENABLED)) disable_irq_nosync(wirq->irq); } } |
| 2 2 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 | /* * Update: The Berkeley copyright was changed, and the change * is retroactive to all "true" BSD software (ie everything * from UCB as opposed to other peoples code that just carried * the same license). The new copyright doesn't clash with the * GPL, so the module-only restriction has been removed.. */ /* Because this code is derived from the 4.3BSD compress source: * * Copyright (c) 1985, 1986 The Regents of the University of California. * All rights reserved. * * This code is derived from software contributed to Berkeley by * James A. Woods, derived from original work by Spencer Thomas * and Joseph Orost. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. All advertising materials mentioning features or use of this software * must display the following acknowledgement: * This product includes software developed by the University of * California, Berkeley and its contributors. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ /* * This version is for use with contiguous buffers on Linux-derived systems. * * ==FILEVERSION 20000226== * * NOTE TO MAINTAINERS: * If you modify this file at all, please set the number above to the * date of the modification as YYMMDD (year month day). * bsd_comp.c is shipped with a PPP distribution as well as with * the kernel; if everyone increases the FILEVERSION number above, * then scripts can do the right thing when deciding whether to * install a new bsd_comp.c file. Don't change the format of that * line otherwise, so the installation script can recognize it. * * From: bsd_comp.c,v 1.3 1994/12/08 01:59:58 paulus Exp */ #include <linux/module.h> #include <linux/init.h> #include <linux/slab.h> #include <linux/vmalloc.h> #include <linux/string.h> #include <linux/ppp_defs.h> #undef PACKETPTR #define PACKETPTR 1 #include <linux/ppp-comp.h> #undef PACKETPTR #include <asm/byteorder.h> /* * PPP "BSD compress" compression * The differences between this compression and the classic BSD LZW * source are obvious from the requirement that the classic code worked * with files while this handles arbitrarily long streams that * are broken into packets. They are: * * When the code size expands, a block of junk is not emitted by * the compressor and not expected by the decompressor. * * New codes are not necessarily assigned every time an old * code is output by the compressor. This is because a packet * end forces a code to be emitted, but does not imply that a * new sequence has been seen. * * The compression ratio is checked at the first end of a packet * after the appropriate gap. Besides simplifying and speeding * things up, this makes it more likely that the transmitter * and receiver will agree when the dictionary is cleared when * compression is not going well. */ /* * Macros to extract protocol version and number of bits * from the third byte of the BSD Compress CCP configuration option. */ #define BSD_VERSION(x) ((x) >> 5) #define BSD_NBITS(x) ((x) & 0x1F) #define BSD_CURRENT_VERSION 1 /* * A dictionary for doing BSD compress. */ struct bsd_dict { union { /* hash value */ unsigned long fcode; struct { #if defined(__LITTLE_ENDIAN) /* Little endian order */ unsigned short prefix; /* preceding code */ unsigned char suffix; /* last character of new code */ unsigned char pad; #elif defined(__BIG_ENDIAN) /* Big endian order */ unsigned char pad; unsigned char suffix; /* last character of new code */ unsigned short prefix; /* preceding code */ #else #error Endianness not defined... #endif } hs; } f; unsigned short codem1; /* output of hash table -1 */ unsigned short cptr; /* map code to hash table entry */ }; struct bsd_db { int totlen; /* length of this structure */ unsigned int hsize; /* size of the hash table */ unsigned char hshift; /* used in hash function */ unsigned char n_bits; /* current bits/code */ unsigned char maxbits; /* maximum bits/code */ unsigned char debug; /* non-zero if debug desired */ unsigned char unit; /* ppp unit number */ unsigned short seqno; /* sequence # of next packet */ unsigned int mru; /* size of receive (decompress) bufr */ unsigned int maxmaxcode; /* largest valid code */ unsigned int max_ent; /* largest code in use */ unsigned int in_count; /* uncompressed bytes, aged */ unsigned int bytes_out; /* compressed bytes, aged */ unsigned int ratio; /* recent compression ratio */ unsigned int checkpoint; /* when to next check the ratio */ unsigned int clear_count; /* times dictionary cleared */ unsigned int incomp_count; /* incompressible packets */ unsigned int incomp_bytes; /* incompressible bytes */ unsigned int uncomp_count; /* uncompressed packets */ unsigned int uncomp_bytes; /* uncompressed bytes */ unsigned int comp_count; /* compressed packets */ unsigned int comp_bytes; /* compressed bytes */ unsigned short *lens; /* array of lengths of codes */ struct bsd_dict *dict; /* dictionary */ }; #define BSD_OVHD 2 /* BSD compress overhead/packet */ #define MIN_BSD_BITS 9 #define BSD_INIT_BITS MIN_BSD_BITS #define MAX_BSD_BITS 15 static void bsd_free (void *state); static void *bsd_alloc(unsigned char *options, int opt_len, int decomp); static void *bsd_comp_alloc (unsigned char *options, int opt_len); static void *bsd_decomp_alloc (unsigned char *options, int opt_len); static int bsd_init (void *db, unsigned char *options, int opt_len, int unit, int debug, int decomp); static int bsd_comp_init (void *state, unsigned char *options, int opt_len, int unit, int opthdr, int debug); static int bsd_decomp_init (void *state, unsigned char *options, int opt_len, int unit, int opthdr, int mru, int debug); static void bsd_reset (void *state); static void bsd_comp_stats (void *state, struct compstat *stats); static int bsd_compress (void *state, unsigned char *rptr, unsigned char *obuf, int isize, int osize); static void bsd_incomp (void *state, unsigned char *ibuf, int icnt); static int bsd_decompress (void *state, unsigned char *ibuf, int isize, unsigned char *obuf, int osize); /* These are in ppp_generic.c */ extern int ppp_register_compressor (struct compressor *cp); extern void ppp_unregister_compressor (struct compressor *cp); /* * the next two codes should not be changed lightly, as they must not * lie within the contiguous general code space. */ #define CLEAR 256 /* table clear output code */ #define FIRST 257 /* first free entry */ #define LAST 255 #define MAXCODE(b) ((1 << (b)) - 1) #define BADCODEM1 MAXCODE(MAX_BSD_BITS) #define BSD_HASH(prefix,suffix,hshift) ((((unsigned long)(suffix))<<(hshift)) \ ^ (unsigned long)(prefix)) #define BSD_KEY(prefix,suffix) ((((unsigned long)(suffix)) << 16) \ + (unsigned long)(prefix)) #define CHECK_GAP 10000 /* Ratio check interval */ #define RATIO_SCALE_LOG 8 #define RATIO_SCALE (1<<RATIO_SCALE_LOG) #define RATIO_MAX (0x7fffffff>>RATIO_SCALE_LOG) /* * clear the dictionary */ static void bsd_clear(struct bsd_db *db) { db->clear_count++; db->max_ent = FIRST-1; db->n_bits = BSD_INIT_BITS; db->bytes_out = 0; db->in_count = 0; db->ratio = 0; db->checkpoint = CHECK_GAP; } /* * If the dictionary is full, then see if it is time to reset it. * * Compute the compression ratio using fixed-point arithmetic * with 8 fractional bits. * * Since we have an infinite stream instead of a single file, * watch only the local compression ratio. * * Since both peers must reset the dictionary at the same time even in * the absence of CLEAR codes (while packets are incompressible), they * must compute the same ratio. */ static int bsd_check (struct bsd_db *db) /* 1=output CLEAR */ { unsigned int new_ratio; if (db->in_count >= db->checkpoint) { /* age the ratio by limiting the size of the counts */ if (db->in_count >= RATIO_MAX || db->bytes_out >= RATIO_MAX) { db->in_count -= (db->in_count >> 2); db->bytes_out -= (db->bytes_out >> 2); } db->checkpoint = db->in_count + CHECK_GAP; if (db->max_ent >= db->maxmaxcode) { /* Reset the dictionary only if the ratio is worse, * or if it looks as if it has been poisoned * by incompressible data. * * This does not overflow, because * db->in_count <= RATIO_MAX. */ new_ratio = db->in_count << RATIO_SCALE_LOG; if (db->bytes_out != 0) { new_ratio /= db->bytes_out; } if (new_ratio < db->ratio || new_ratio < 1 * RATIO_SCALE) { bsd_clear (db); return 1; } db->ratio = new_ratio; } } return 0; } /* * Return statistics. */ static void bsd_comp_stats (void *state, struct compstat *stats) { struct bsd_db *db = (struct bsd_db *) state; stats->unc_bytes = db->uncomp_bytes; stats->unc_packets = db->uncomp_count; stats->comp_bytes = db->comp_bytes; stats->comp_packets = db->comp_count; stats->inc_bytes = db->incomp_bytes; stats->inc_packets = db->incomp_count; stats->in_count = db->in_count; stats->bytes_out = db->bytes_out; } /* * Reset state, as on a CCP ResetReq. */ static void bsd_reset (void *state) { struct bsd_db *db = (struct bsd_db *) state; bsd_clear(db); db->seqno = 0; db->clear_count = 0; } /* * Release the compression structure */ static void bsd_free (void *state) { struct bsd_db *db = state; if (!db) return; /* * Release the dictionary */ vfree(db->dict); db->dict = NULL; /* * Release the string buffer */ vfree(db->lens); db->lens = NULL; /* * Finally release the structure itself. */ kfree(db); } /* * Allocate space for a (de) compressor. */ static void *bsd_alloc (unsigned char *options, int opt_len, int decomp) { int bits; unsigned int hsize, hshift, maxmaxcode; struct bsd_db *db; if (opt_len != 3 || options[0] != CI_BSD_COMPRESS || options[1] != 3 || BSD_VERSION(options[2]) != BSD_CURRENT_VERSION) { return NULL; } bits = BSD_NBITS(options[2]); switch (bits) { case 9: /* needs 82152 for both directions */ case 10: /* needs 84144 */ case 11: /* needs 88240 */ case 12: /* needs 96432 */ hsize = 5003; hshift = 4; break; case 13: /* needs 176784 */ hsize = 9001; hshift = 5; break; case 14: /* needs 353744 */ hsize = 18013; hshift = 6; break; case 15: /* needs 691440 */ hsize = 35023; hshift = 7; break; case 16: /* needs 1366160--far too much, */ /* hsize = 69001; */ /* and 69001 is too big for cptr */ /* hshift = 8; */ /* in struct bsd_db */ /* break; */ default: return NULL; } /* * Allocate the main control structure for this instance. */ maxmaxcode = MAXCODE(bits); db = kzalloc(sizeof (struct bsd_db), GFP_KERNEL); if (!db) { return NULL; } /* * Allocate space for the dictionary. This may be more than one page in * length. */ db->dict = vmalloc(array_size(hsize, sizeof(struct bsd_dict))); if (!db->dict) { bsd_free (db); return NULL; } /* * If this is the compression buffer then there is no length data. */ if (!decomp) { db->lens = NULL; } /* * For decompression, the length information is needed as well. */ else { db->lens = vmalloc(array_size(sizeof(db->lens[0]), (maxmaxcode + 1))); if (!db->lens) { bsd_free (db); return NULL; } } /* * Initialize the data information for the compression code */ db->totlen = sizeof (struct bsd_db) + (sizeof (struct bsd_dict) * hsize); db->hsize = hsize; db->hshift = hshift; db->maxmaxcode = maxmaxcode; db->maxbits = bits; return (void *) db; } static void *bsd_comp_alloc (unsigned char *options, int opt_len) { return bsd_alloc (options, opt_len, 0); } static void *bsd_decomp_alloc (unsigned char *options, int opt_len) { return bsd_alloc (options, opt_len, 1); } /* * Initialize the database. */ static int bsd_init (void *state, unsigned char *options, int opt_len, int unit, int debug, int decomp) { struct bsd_db *db = state; int indx; if ((opt_len != 3) || (options[0] != CI_BSD_COMPRESS) || (options[1] != 3) || (BSD_VERSION(options[2]) != BSD_CURRENT_VERSION) || (BSD_NBITS(options[2]) != db->maxbits) || (decomp && db->lens == NULL)) { return 0; } if (decomp) { indx = LAST; do { db->lens[indx] = 1; } while (indx-- > 0); } indx = db->hsize; while (indx-- != 0) { db->dict[indx].codem1 = BADCODEM1; db->dict[indx].cptr = 0; } db->unit = unit; db->mru = 0; #ifndef DEBUG if (debug) #endif db->debug = 1; bsd_reset(db); return 1; } static int bsd_comp_init (void *state, unsigned char *options, int opt_len, int unit, int opthdr, int debug) { return bsd_init (state, options, opt_len, unit, debug, 0); } static int bsd_decomp_init (void *state, unsigned char *options, int opt_len, int unit, int opthdr, int mru, int debug) { return bsd_init (state, options, opt_len, unit, debug, 1); } /* * Obtain pointers to the various structures in the compression tables */ #define dict_ptrx(p,idx) &(p->dict[idx]) #define lens_ptrx(p,idx) &(p->lens[idx]) #ifdef DEBUG static unsigned short *lens_ptr(struct bsd_db *db, int idx) { if ((unsigned int) idx > (unsigned int) db->maxmaxcode) { printk ("<9>ppp: lens_ptr(%d) > max\n", idx); idx = 0; } return lens_ptrx (db, idx); } static struct bsd_dict *dict_ptr(struct bsd_db *db, int idx) { if ((unsigned int) idx >= (unsigned int) db->hsize) { printk ("<9>ppp: dict_ptr(%d) > max\n", idx); idx = 0; } return dict_ptrx (db, idx); } #else #define lens_ptr(db,idx) lens_ptrx(db,idx) #define dict_ptr(db,idx) dict_ptrx(db,idx) #endif /* * compress a packet * * The result of this function is the size of the compressed * packet. A zero is returned if the packet was not compressed * for some reason, such as the size being larger than uncompressed. * * One change from the BSD compress command is that when the * code size expands, we do not output a bunch of padding. */ static int bsd_compress (void *state, unsigned char *rptr, unsigned char *obuf, int isize, int osize) { struct bsd_db *db; int hshift; unsigned int max_ent; unsigned int n_bits; unsigned int bitno; unsigned long accm; int ent; unsigned long fcode; struct bsd_dict *dictp; unsigned char c; int hval; int disp; int ilen; int mxcode; unsigned char *wptr; int olen; #define PUTBYTE(v) \ { \ ++olen; \ if (wptr) \ { \ *wptr++ = (unsigned char) (v); \ if (olen >= osize) \ { \ wptr = NULL; \ } \ } \ } #define OUTPUT(ent) \ { \ bitno -= n_bits; \ accm |= ((ent) << bitno); \ do \ { \ PUTBYTE(accm >> 24); \ accm <<= 8; \ bitno += 8; \ } \ while (bitno <= 24); \ } /* * If the protocol is not in the range we're interested in, * just return without compressing the packet. If it is, * the protocol becomes the first byte to compress. */ ent = PPP_PROTOCOL(rptr); if (ent < 0x21 || ent > 0xf9) { return 0; } db = (struct bsd_db *) state; hshift = db->hshift; max_ent = db->max_ent; n_bits = db->n_bits; bitno = 32; accm = 0; mxcode = MAXCODE (n_bits); /* Initialize the output pointers */ wptr = obuf; olen = PPP_HDRLEN + BSD_OVHD; if (osize > isize) { osize = isize; } /* This is the PPP header information */ if (wptr) { *wptr++ = PPP_ADDRESS(rptr); *wptr++ = PPP_CONTROL(rptr); *wptr++ = 0; *wptr++ = PPP_COMP; *wptr++ = db->seqno >> 8; *wptr++ = db->seqno; } /* Skip the input header */ rptr += PPP_HDRLEN; isize -= PPP_HDRLEN; ilen = ++isize; /* Low byte of protocol is counted as input */ while (--ilen > 0) { c = *rptr++; fcode = BSD_KEY (ent, c); hval = BSD_HASH (ent, c, hshift); dictp = dict_ptr (db, hval); /* Validate and then check the entry. */ if (dictp->codem1 >= max_ent) { goto nomatch; } if (dictp->f.fcode == fcode) { ent = dictp->codem1 + 1; continue; /* found (prefix,suffix) */ } /* continue probing until a match or invalid entry */ disp = (hval == 0) ? 1 : hval; do { hval += disp; if (hval >= db->hsize) { hval -= db->hsize; } dictp = dict_ptr (db, hval); if (dictp->codem1 >= max_ent) { goto nomatch; } } while (dictp->f.fcode != fcode); ent = dictp->codem1 + 1; /* finally found (prefix,suffix) */ continue; nomatch: OUTPUT(ent); /* output the prefix */ /* code -> hashtable */ if (max_ent < db->maxmaxcode) { struct bsd_dict *dictp2; struct bsd_dict *dictp3; int indx; /* expand code size if needed */ if (max_ent >= mxcode) { db->n_bits = ++n_bits; mxcode = MAXCODE (n_bits); } /* Invalidate old hash table entry using * this code, and then take it over. */ dictp2 = dict_ptr (db, max_ent + 1); indx = dictp2->cptr; dictp3 = dict_ptr (db, indx); if (dictp3->codem1 == max_ent) { dictp3->codem1 = BADCODEM1; } dictp2->cptr = hval; dictp->codem1 = max_ent; dictp->f.fcode = fcode; db->max_ent = ++max_ent; if (db->lens) { unsigned short *len1 = lens_ptr (db, max_ent); unsigned short *len2 = lens_ptr (db, ent); *len1 = *len2 + 1; } } ent = c; } OUTPUT(ent); /* output the last code */ db->bytes_out += olen - PPP_HDRLEN - BSD_OVHD; db->uncomp_bytes += isize; db->in_count += isize; ++db->uncomp_count; ++db->seqno; if (bitno < 32) { ++db->bytes_out; /* must be set before calling bsd_check */ } /* * Generate the clear command if needed */ if (bsd_check(db)) { OUTPUT (CLEAR); } /* * Pad dribble bits of last code with ones. * Do not emit a completely useless byte of ones. */ if (bitno != 32) { PUTBYTE((accm | (0xff << (bitno-8))) >> 24); } /* * Increase code size if we would have without the packet * boundary because the decompressor will do so. */ if (max_ent >= mxcode && max_ent < db->maxmaxcode) { db->n_bits++; } /* If output length is too large then this is an incomplete frame. */ if (wptr == NULL) { ++db->incomp_count; db->incomp_bytes += isize; olen = 0; } else /* Count the number of compressed frames */ { ++db->comp_count; db->comp_bytes += olen; } /* Return the resulting output length */ return olen; #undef OUTPUT #undef PUTBYTE } /* * Update the "BSD Compress" dictionary on the receiver for * incompressible data by pretending to compress the incoming data. */ static void bsd_incomp (void *state, unsigned char *ibuf, int icnt) { (void) bsd_compress (state, ibuf, (char *) 0, icnt, 0); } /* * Decompress "BSD Compress". * * Because of patent problems, we return DECOMP_ERROR for errors * found by inspecting the input data and for system problems, but * DECOMP_FATALERROR for any errors which could possibly be said to * be being detected "after" decompression. For DECOMP_ERROR, * we can issue a CCP reset-request; for DECOMP_FATALERROR, we may be * infringing a patent of Motorola's if we do, so we take CCP down * instead. * * Given that the frame has the correct sequence number and a good FCS, * errors such as invalid codes in the input most likely indicate a * bug, so we return DECOMP_FATALERROR for them in order to turn off * compression, even though they are detected by inspecting the input. */ static int bsd_decompress (void *state, unsigned char *ibuf, int isize, unsigned char *obuf, int osize) { struct bsd_db *db; unsigned int max_ent; unsigned long accm; unsigned int bitno; /* 1st valid bit in accm */ unsigned int n_bits; unsigned int tgtbitno; /* bitno when we have a code */ struct bsd_dict *dictp; int explen; int seq; unsigned int incode; unsigned int oldcode; unsigned int finchar; unsigned char *p; unsigned char *wptr; int adrs; int ctrl; int ilen; int codelen; int extra; db = (struct bsd_db *) state; max_ent = db->max_ent; accm = 0; bitno = 32; /* 1st valid bit in accm */ n_bits = db->n_bits; tgtbitno = 32 - n_bits; /* bitno when we have a code */ /* * Save the address/control from the PPP header * and then get the sequence number. */ adrs = PPP_ADDRESS (ibuf); ctrl = PPP_CONTROL (ibuf); seq = (ibuf[4] << 8) + ibuf[5]; ibuf += (PPP_HDRLEN + 2); ilen = isize - (PPP_HDRLEN + 2); /* * Check the sequence number and give up if it differs from * the value we're expecting. */ if (seq != db->seqno) { if (db->debug) { printk("bsd_decomp%d: bad sequence # %d, expected %d\n", db->unit, seq, db->seqno - 1); } return DECOMP_ERROR; } ++db->seqno; db->bytes_out += ilen; /* * Fill in the ppp header, but not the last byte of the protocol * (that comes from the decompressed data). */ wptr = obuf; *wptr++ = adrs; *wptr++ = ctrl; *wptr++ = 0; oldcode = CLEAR; explen = 3; /* * Keep the checkpoint correctly so that incompressible packets * clear the dictionary at the proper times. */ for (;;) { if (ilen-- <= 0) { db->in_count += (explen - 3); /* don't count the header */ break; } /* * Accumulate bytes until we have a complete code. * Then get the next code, relying on the 32-bit, * unsigned accm to mask the result. */ bitno -= 8; accm |= *ibuf++ << bitno; if (tgtbitno < bitno) { continue; } incode = accm >> tgtbitno; accm <<= n_bits; bitno += n_bits; /* * The dictionary must only be cleared at the end of a packet. */ if (incode == CLEAR) { if (ilen > 0) { if (db->debug) { printk("bsd_decomp%d: bad CLEAR\n", db->unit); } return DECOMP_FATALERROR; /* probably a bug */ } bsd_clear(db); break; } if ((incode > max_ent + 2) || (incode > db->maxmaxcode) || (incode > max_ent && oldcode == CLEAR)) { if (db->debug) { printk("bsd_decomp%d: bad code 0x%x oldcode=0x%x ", db->unit, incode, oldcode); printk("max_ent=0x%x explen=%d seqno=%d\n", max_ent, explen, db->seqno); } return DECOMP_FATALERROR; /* probably a bug */ } /* Special case for KwKwK string. */ if (incode > max_ent) { finchar = oldcode; extra = 1; } else { finchar = incode; extra = 0; } codelen = *(lens_ptr (db, finchar)); explen += codelen + extra; if (explen > osize) { if (db->debug) { printk("bsd_decomp%d: ran out of mru\n", db->unit); #ifdef DEBUG printk(" len=%d, finchar=0x%x, codelen=%d, explen=%d\n", ilen, finchar, codelen, explen); #endif } return DECOMP_FATALERROR; } /* * Decode this code and install it in the decompressed buffer. */ wptr += codelen; p = wptr; while (finchar > LAST) { struct bsd_dict *dictp2 = dict_ptr (db, finchar); dictp = dict_ptr (db, dictp2->cptr); #ifdef DEBUG if (--codelen <= 0 || dictp->codem1 != finchar-1) { if (codelen <= 0) { printk("bsd_decomp%d: fell off end of chain ", db->unit); printk("0x%x at 0x%x by 0x%x, max_ent=0x%x\n", incode, finchar, dictp2->cptr, max_ent); } else { if (dictp->codem1 != finchar-1) { printk("bsd_decomp%d: bad code chain 0x%x " "finchar=0x%x ", db->unit, incode, finchar); printk("oldcode=0x%x cptr=0x%x codem1=0x%x\n", oldcode, dictp2->cptr, dictp->codem1); } } return DECOMP_FATALERROR; } #endif *--p = dictp->f.hs.suffix; finchar = dictp->f.hs.prefix; } *--p = finchar; #ifdef DEBUG if (--codelen != 0) { printk("bsd_decomp%d: short by %d after code 0x%x, max_ent=0x%x\n", db->unit, codelen, incode, max_ent); } #endif if (extra) /* the KwKwK case again */ { *wptr++ = finchar; } /* * If not first code in a packet, and * if not out of code space, then allocate a new code. * * Keep the hash table correct so it can be used * with uncompressed packets. */ if (oldcode != CLEAR && max_ent < db->maxmaxcode) { struct bsd_dict *dictp2, *dictp3; unsigned short *lens1, *lens2; unsigned long fcode; int hval, disp, indx; fcode = BSD_KEY(oldcode,finchar); hval = BSD_HASH(oldcode,finchar,db->hshift); dictp = dict_ptr (db, hval); /* look for a free hash table entry */ if (dictp->codem1 < max_ent) { disp = (hval == 0) ? 1 : hval; do { hval += disp; if (hval >= db->hsize) { hval -= db->hsize; } dictp = dict_ptr (db, hval); } while (dictp->codem1 < max_ent); } /* * Invalidate previous hash table entry * assigned this code, and then take it over */ dictp2 = dict_ptr (db, max_ent + 1); indx = dictp2->cptr; dictp3 = dict_ptr (db, indx); if (dictp3->codem1 == max_ent) { dictp3->codem1 = BADCODEM1; } dictp2->cptr = hval; dictp->codem1 = max_ent; dictp->f.fcode = fcode; db->max_ent = ++max_ent; /* Update the length of this string. */ lens1 = lens_ptr (db, max_ent); lens2 = lens_ptr (db, oldcode); *lens1 = *lens2 + 1; /* Expand code size if needed. */ if (max_ent >= MAXCODE(n_bits) && max_ent < db->maxmaxcode) { db->n_bits = ++n_bits; tgtbitno = 32-n_bits; } } oldcode = incode; } ++db->comp_count; ++db->uncomp_count; db->comp_bytes += isize - BSD_OVHD - PPP_HDRLEN; db->uncomp_bytes += explen; if (bsd_check(db)) { if (db->debug) { printk("bsd_decomp%d: peer should have cleared dictionary on %d\n", db->unit, db->seqno - 1); } } return explen; } /************************************************************* * Table of addresses for the BSD compression module *************************************************************/ static struct compressor ppp_bsd_compress = { .compress_proto = CI_BSD_COMPRESS, .comp_alloc = bsd_comp_alloc, .comp_free = bsd_free, .comp_init = bsd_comp_init, .comp_reset = bsd_reset, .compress = bsd_compress, .comp_stat = bsd_comp_stats, .decomp_alloc = bsd_decomp_alloc, .decomp_free = bsd_free, .decomp_init = bsd_decomp_init, .decomp_reset = bsd_reset, .decompress = bsd_decompress, .incomp = bsd_incomp, .decomp_stat = bsd_comp_stats, .owner = THIS_MODULE }; /************************************************************* * Module support routines *************************************************************/ static int __init bsdcomp_init(void) { int answer = ppp_register_compressor(&ppp_bsd_compress); if (answer == 0) printk(KERN_INFO "PPP BSD Compression module registered\n"); return answer; } static void __exit bsdcomp_cleanup(void) { ppp_unregister_compressor(&ppp_bsd_compress); } module_init(bsdcomp_init); module_exit(bsdcomp_cleanup); MODULE_DESCRIPTION("PPP BSD-Compress compression module"); MODULE_LICENSE("Dual BSD/GPL"); MODULE_ALIAS("ppp-compress-" __stringify(CI_BSD_COMPRESS)); |
| 21 21 17 4 21 21 21 19 19 19 21 21 21 21 21 21 21 19 19 19 19 19 19 19 19 19 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 | // SPDX-License-Identifier: GPL-2.0-only /* Object lifetime handling and tracing. * * Copyright (C) 2022 Red Hat, Inc. All Rights Reserved. * Written by David Howells (dhowells@redhat.com) */ #include <linux/slab.h> #include <linux/mempool.h> #include <linux/delay.h> #include "internal.h" static void netfs_free_request(struct work_struct *work); /* * Allocate an I/O request and initialise it. */ struct netfs_io_request *netfs_alloc_request(struct address_space *mapping, struct file *file, loff_t start, size_t len, enum netfs_io_origin origin) { static atomic_t debug_ids; struct inode *inode = file ? file_inode(file) : mapping->host; struct netfs_inode *ctx = netfs_inode(inode); struct netfs_io_request *rreq; mempool_t *mempool = ctx->ops->request_pool ?: &netfs_request_pool; struct kmem_cache *cache = mempool->pool_data; int ret; for (;;) { rreq = mempool_alloc(mempool, GFP_KERNEL); if (rreq) break; msleep(10); } memset(rreq, 0, kmem_cache_size(cache)); INIT_WORK(&rreq->cleanup_work, netfs_free_request); rreq->start = start; rreq->len = len; rreq->origin = origin; rreq->netfs_ops = ctx->ops; rreq->mapping = mapping; rreq->inode = inode; rreq->i_size = i_size_read(inode); rreq->debug_id = atomic_inc_return(&debug_ids); rreq->wsize = INT_MAX; rreq->io_streams[0].sreq_max_len = ULONG_MAX; rreq->io_streams[0].sreq_max_segs = 0; spin_lock_init(&rreq->lock); INIT_LIST_HEAD(&rreq->io_streams[0].subrequests); INIT_LIST_HEAD(&rreq->io_streams[1].subrequests); init_waitqueue_head(&rreq->waitq); refcount_set(&rreq->ref, 2); if (origin == NETFS_READAHEAD || origin == NETFS_READPAGE || origin == NETFS_READ_GAPS || origin == NETFS_READ_SINGLE || origin == NETFS_READ_FOR_WRITE || origin == NETFS_UNBUFFERED_READ || origin == NETFS_DIO_READ) { INIT_WORK(&rreq->work, netfs_read_collection_worker); rreq->io_streams[0].avail = true; } else { INIT_WORK(&rreq->work, netfs_write_collection_worker); } __set_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags); if (rreq->netfs_ops->init_request) { ret = rreq->netfs_ops->init_request(rreq, file); if (ret < 0) { mempool_free(rreq, rreq->netfs_ops->request_pool ?: &netfs_request_pool); return ERR_PTR(ret); } } atomic_inc(&ctx->io_count); trace_netfs_rreq_ref(rreq->debug_id, refcount_read(&rreq->ref), netfs_rreq_trace_new); netfs_proc_add_rreq(rreq); netfs_stat(&netfs_n_rh_rreq); return rreq; } void netfs_get_request(struct netfs_io_request *rreq, enum netfs_rreq_ref_trace what) { int r; __refcount_inc(&rreq->ref, &r); trace_netfs_rreq_ref(rreq->debug_id, r + 1, what); } void netfs_clear_subrequests(struct netfs_io_request *rreq) { struct netfs_io_subrequest *subreq; struct netfs_io_stream *stream; int s; for (s = 0; s < ARRAY_SIZE(rreq->io_streams); s++) { stream = &rreq->io_streams[s]; while (!list_empty(&stream->subrequests)) { subreq = list_first_entry(&stream->subrequests, struct netfs_io_subrequest, rreq_link); list_del(&subreq->rreq_link); netfs_put_subrequest(subreq, netfs_sreq_trace_put_clear); } } } static void netfs_free_request_rcu(struct rcu_head *rcu) { struct netfs_io_request *rreq = container_of(rcu, struct netfs_io_request, rcu); mempool_free(rreq, rreq->netfs_ops->request_pool ?: &netfs_request_pool); netfs_stat_d(&netfs_n_rh_rreq); } static void netfs_free_request(struct work_struct *work) { struct netfs_io_request *rreq = container_of(work, struct netfs_io_request, cleanup_work); struct netfs_inode *ictx = netfs_inode(rreq->inode); unsigned int i; trace_netfs_rreq(rreq, netfs_rreq_trace_free); /* Cancel/flush the result collection worker. That does not carry a * ref of its own, so we must wait for it somewhere. */ cancel_work_sync(&rreq->work); netfs_proc_del_rreq(rreq); netfs_clear_subrequests(rreq); if (rreq->netfs_ops->free_request) rreq->netfs_ops->free_request(rreq); if (rreq->cache_resources.ops) rreq->cache_resources.ops->end_operation(&rreq->cache_resources); if (rreq->direct_bv) { for (i = 0; i < rreq->direct_bv_count; i++) { if (rreq->direct_bv[i].bv_page) { if (rreq->direct_bv_unpin) unpin_user_page(rreq->direct_bv[i].bv_page); } } kvfree(rreq->direct_bv); } rolling_buffer_clear(&rreq->buffer); if (atomic_dec_and_test(&ictx->io_count)) wake_up_var(&ictx->io_count); call_rcu(&rreq->rcu, netfs_free_request_rcu); } void netfs_put_request(struct netfs_io_request *rreq, enum netfs_rreq_ref_trace what) { unsigned int debug_id; bool dead; int r; if (rreq) { debug_id = rreq->debug_id; dead = __refcount_dec_and_test(&rreq->ref, &r); trace_netfs_rreq_ref(debug_id, r - 1, what); if (dead) WARN_ON(!queue_work(system_unbound_wq, &rreq->cleanup_work)); } } /* * Allocate and partially initialise an I/O request structure. */ struct netfs_io_subrequest *netfs_alloc_subrequest(struct netfs_io_request *rreq) { struct netfs_io_subrequest *subreq; mempool_t *mempool = rreq->netfs_ops->subrequest_pool ?: &netfs_subrequest_pool; struct kmem_cache *cache = mempool->pool_data; for (;;) { subreq = mempool_alloc(rreq->netfs_ops->subrequest_pool ?: &netfs_subrequest_pool, GFP_KERNEL); if (subreq) break; msleep(10); } memset(subreq, 0, kmem_cache_size(cache)); INIT_WORK(&subreq->work, NULL); INIT_LIST_HEAD(&subreq->rreq_link); refcount_set(&subreq->ref, 2); subreq->rreq = rreq; subreq->debug_index = atomic_inc_return(&rreq->subreq_counter); netfs_get_request(rreq, netfs_rreq_trace_get_subreq); netfs_stat(&netfs_n_rh_sreq); return subreq; } void netfs_get_subrequest(struct netfs_io_subrequest *subreq, enum netfs_sreq_ref_trace what) { int r; __refcount_inc(&subreq->ref, &r); trace_netfs_sreq_ref(subreq->rreq->debug_id, subreq->debug_index, r + 1, what); } static void netfs_free_subrequest(struct netfs_io_subrequest *subreq) { struct netfs_io_request *rreq = subreq->rreq; trace_netfs_sreq(subreq, netfs_sreq_trace_free); if (rreq->netfs_ops->free_subrequest) rreq->netfs_ops->free_subrequest(subreq); mempool_free(subreq, rreq->netfs_ops->subrequest_pool ?: &netfs_subrequest_pool); netfs_stat_d(&netfs_n_rh_sreq); netfs_put_request(rreq, netfs_rreq_trace_put_subreq); } void netfs_put_subrequest(struct netfs_io_subrequest *subreq, enum netfs_sreq_ref_trace what) { unsigned int debug_index = subreq->debug_index; unsigned int debug_id = subreq->rreq->debug_id; bool dead; int r; dead = __refcount_dec_and_test(&subreq->ref, &r); trace_netfs_sreq_ref(debug_id, debug_index, r - 1, what); if (dead) netfs_free_subrequest(subreq); } |
| 2 2 2 28 28 2 1 28 245 121 121 3 20 1 2 17 2 16 13 4 2 2 22 1 14 2 5 20 2 17 2 1 1 15 89 104 104 12 6 4 1 1 2 77 15 76 69 22 80 80 167 167 1 12 7 2 2 12 10 4 8 10 248 113 134 108 24 3 1 2 4 1 1 20 5 76 3 140 33 107 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 | // SPDX-License-Identifier: GPL-2.0-only /* * Changes: * Arnaldo Carvalho de Melo <acme@conectiva.com.br> 08/23/2000 * - get rid of some verify_areas and use __copy*user and __get/put_user * for the ones that remain */ #include <linux/module.h> #include <linux/blkdev.h> #include <linux/interrupt.h> #include <linux/errno.h> #include <linux/kernel.h> #include <linux/sched.h> #include <linux/mm.h> #include <linux/string.h> #include <linux/uaccess.h> #include <linux/cdrom.h> #include <scsi/scsi.h> #include <scsi/scsi_cmnd.h> #include <scsi/scsi_device.h> #include <scsi/scsi_eh.h> #include <scsi/scsi_host.h> #include <scsi/scsi_ioctl.h> #include <scsi/sg.h> #include <scsi/scsi_dbg.h> #include "scsi_logging.h" #define NORMAL_RETRIES 5 #define IOCTL_NORMAL_TIMEOUT (10 * HZ) #define MAX_BUF PAGE_SIZE /** * ioctl_probe -- return host identification * @host: host to identify * @buffer: userspace buffer for identification * * Return: * * if successful, %1 and an identifying string at @buffer, if @buffer * is non-NULL, filling to the length stored at * (int *) @buffer. * * <0 error code on failure. */ static int ioctl_probe(struct Scsi_Host *host, void __user *buffer) { unsigned int len, slen; const char *string; if (buffer) { if (get_user(len, (unsigned int __user *) buffer)) return -EFAULT; if (host->hostt->info) string = host->hostt->info(host); else string = host->hostt->name; if (string) { slen = strlen(string); if (len > slen) len = slen + 1; if (copy_to_user(buffer, string, len)) return -EFAULT; } } return 1; } static int ioctl_internal_command(struct scsi_device *sdev, char *cmd, int timeout, int retries) { int result; struct scsi_sense_hdr sshdr; const struct scsi_exec_args exec_args = { .sshdr = &sshdr, }; SCSI_LOG_IOCTL(1, sdev_printk(KERN_INFO, sdev, "Trying ioctl with scsi command %d\n", *cmd)); result = scsi_execute_cmd(sdev, cmd, REQ_OP_DRV_IN, NULL, 0, timeout, retries, &exec_args); SCSI_LOG_IOCTL(2, sdev_printk(KERN_INFO, sdev, "Ioctl returned 0x%x\n", result)); if (result < 0) goto out; if (scsi_sense_valid(&sshdr)) { switch (sshdr.sense_key) { case ILLEGAL_REQUEST: if (cmd[0] == ALLOW_MEDIUM_REMOVAL) sdev->lockable = 0; else sdev_printk(KERN_INFO, sdev, "ioctl_internal_command: " "ILLEGAL REQUEST " "asc=0x%x ascq=0x%x\n", sshdr.asc, sshdr.ascq); break; case NOT_READY: /* This happens if there is no disc in drive */ if (sdev->removable) break; fallthrough; case UNIT_ATTENTION: if (sdev->removable) { sdev->changed = 1; result = 0; /* This is no longer considered an error */ break; } fallthrough; /* for non-removable media */ default: sdev_printk(KERN_INFO, sdev, "ioctl_internal_command return code = %x\n", result); scsi_print_sense_hdr(sdev, NULL, &sshdr); break; } } out: SCSI_LOG_IOCTL(2, sdev_printk(KERN_INFO, sdev, "IOCTL Releasing command\n")); return result; } /** * scsi_set_medium_removal() - send command to allow or prevent medium removal * @sdev: target scsi device * @state: removal state to set (prevent or allow) * * Returns: * * %0 if @sdev is not removable or not lockable or successful. * * non-%0 is a SCSI result code if > 0 or kernel error code if < 0. * * Sets @sdev->locked to the new state on success. */ int scsi_set_medium_removal(struct scsi_device *sdev, char state) { char scsi_cmd[MAX_COMMAND_SIZE]; int ret; if (!sdev->removable || !sdev->lockable) return 0; scsi_cmd[0] = ALLOW_MEDIUM_REMOVAL; scsi_cmd[1] = 0; scsi_cmd[2] = 0; scsi_cmd[3] = 0; scsi_cmd[4] = state; scsi_cmd[5] = 0; ret = ioctl_internal_command(sdev, scsi_cmd, IOCTL_NORMAL_TIMEOUT, NORMAL_RETRIES); if (ret == 0) sdev->locked = (state == SCSI_REMOVAL_PREVENT); return ret; } EXPORT_SYMBOL(scsi_set_medium_removal); /* * The scsi_ioctl_get_pci() function places into arg the value * pci_dev::slot_name (8 characters) for the PCI device (if any). * Returns: 0 on success * -ENXIO if there isn't a PCI device pointer * (could be because the SCSI driver hasn't been * updated yet, or because it isn't a SCSI * device) * any copy_to_user() error on failure there */ static int scsi_ioctl_get_pci(struct scsi_device *sdev, void __user *arg) { struct device *dev = scsi_get_device(sdev->host); const char *name; if (!dev) return -ENXIO; name = dev_name(dev); /* compatibility with old ioctl which only returned * 20 characters */ return copy_to_user(arg, name, min(strlen(name), (size_t)20)) ? -EFAULT: 0; } static int sg_get_version(int __user *p) { static const int sg_version_num = 30527; return put_user(sg_version_num, p); } static int sg_set_timeout(struct scsi_device *sdev, int __user *p) { int timeout, err = get_user(timeout, p); if (!err) sdev->sg_timeout = clock_t_to_jiffies(timeout); return err; } static int sg_get_reserved_size(struct scsi_device *sdev, int __user *p) { int val = min(sdev->sg_reserved_size, queue_max_bytes(sdev->request_queue)); return put_user(val, p); } static int sg_set_reserved_size(struct scsi_device *sdev, int __user *p) { int size, err = get_user(size, p); if (err) return err; if (size < 0) return -EINVAL; sdev->sg_reserved_size = min_t(unsigned int, size, queue_max_bytes(sdev->request_queue)); return 0; } /* * will always return that we are ATAPI even for a real SCSI drive, I'm not * so sure this is worth doing anything about (why would you care??) */ static int sg_emulated_host(struct request_queue *q, int __user *p) { return put_user(1, p); } static int scsi_get_idlun(struct scsi_device *sdev, void __user *argp) { struct scsi_idlun v = { .dev_id = (sdev->id & 0xff) + ((sdev->lun & 0xff) << 8) + ((sdev->channel & 0xff) << 16) + ((sdev->host->host_no & 0xff) << 24), .host_unique_id = sdev->host->unique_id }; if (copy_to_user(argp, &v, sizeof(struct scsi_idlun))) return -EFAULT; return 0; } static int scsi_send_start_stop(struct scsi_device *sdev, int data) { u8 cdb[MAX_COMMAND_SIZE] = { }; cdb[0] = START_STOP; cdb[4] = data; return ioctl_internal_command(sdev, cdb, START_STOP_TIMEOUT, NORMAL_RETRIES); } /** * scsi_cmd_allowed() - Check if the given command is allowed. * @cmd: SCSI command to check * @open_for_write: is the file / block device opened for writing? * * Only a subset of commands are allowed for unprivileged users. Commands used * to format the media, update the firmware, etc. are not permitted. * * Return: %true if the cmd is allowed, otherwise @false. */ bool scsi_cmd_allowed(unsigned char *cmd, bool open_for_write) { /* root can do any command. */ if (capable(CAP_SYS_RAWIO)) return true; /* Anybody who can open the device can do a read-safe command */ switch (cmd[0]) { /* Basic read-only commands */ case TEST_UNIT_READY: case REQUEST_SENSE: case READ_6: case READ_10: case READ_12: case READ_16: case READ_BUFFER: case READ_DEFECT_DATA: case READ_CAPACITY: /* also GPCMD_READ_CDVD_CAPACITY */ case READ_LONG: case INQUIRY: case MODE_SENSE: case MODE_SENSE_10: case LOG_SENSE: case START_STOP: case GPCMD_VERIFY_10: case VERIFY_16: case REPORT_LUNS: case SERVICE_ACTION_IN_16: case RECEIVE_DIAGNOSTIC: case MAINTENANCE_IN: /* also GPCMD_SEND_KEY, which is a write command */ case GPCMD_READ_BUFFER_CAPACITY: /* Audio CD commands */ case GPCMD_PLAY_CD: case GPCMD_PLAY_AUDIO_10: case GPCMD_PLAY_AUDIO_MSF: case GPCMD_PLAY_AUDIO_TI: case GPCMD_PAUSE_RESUME: /* CD/DVD data reading */ case GPCMD_READ_CD: case GPCMD_READ_CD_MSF: case GPCMD_READ_DISC_INFO: case GPCMD_READ_DVD_STRUCTURE: case GPCMD_READ_HEADER: case GPCMD_READ_TRACK_RZONE_INFO: case GPCMD_READ_SUBCHANNEL: case GPCMD_READ_TOC_PMA_ATIP: case GPCMD_REPORT_KEY: case GPCMD_SCAN: case GPCMD_GET_CONFIGURATION: case GPCMD_READ_FORMAT_CAPACITIES: case GPCMD_GET_EVENT_STATUS_NOTIFICATION: case GPCMD_GET_PERFORMANCE: case GPCMD_SEEK: case GPCMD_STOP_PLAY_SCAN: /* ZBC */ case ZBC_IN: return true; /* Basic writing commands */ case WRITE_6: case WRITE_10: case WRITE_VERIFY: case WRITE_12: case WRITE_VERIFY_12: case WRITE_16: case WRITE_LONG: case WRITE_LONG_2: case WRITE_SAME: case WRITE_SAME_16: case WRITE_SAME_32: case ERASE: case GPCMD_MODE_SELECT_10: case MODE_SELECT: case LOG_SELECT: case GPCMD_BLANK: case GPCMD_CLOSE_TRACK: case GPCMD_FLUSH_CACHE: case GPCMD_FORMAT_UNIT: case GPCMD_REPAIR_RZONE_TRACK: case GPCMD_RESERVE_RZONE_TRACK: case GPCMD_SEND_DVD_STRUCTURE: case GPCMD_SEND_EVENT: case GPCMD_SEND_OPC: case GPCMD_SEND_CUE_SHEET: case GPCMD_SET_SPEED: case GPCMD_PREVENT_ALLOW_MEDIUM_REMOVAL: case GPCMD_LOAD_UNLOAD: case GPCMD_SET_STREAMING: case GPCMD_SET_READ_AHEAD: /* ZBC */ case ZBC_OUT: return open_for_write; default: return false; } } EXPORT_SYMBOL(scsi_cmd_allowed); static int scsi_fill_sghdr_rq(struct scsi_device *sdev, struct request *rq, struct sg_io_hdr *hdr, bool open_for_write) { struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(rq); if (hdr->cmd_len < 6) return -EMSGSIZE; if (copy_from_user(scmd->cmnd, hdr->cmdp, hdr->cmd_len)) return -EFAULT; if (!scsi_cmd_allowed(scmd->cmnd, open_for_write)) return -EPERM; scmd->cmd_len = hdr->cmd_len; rq->timeout = msecs_to_jiffies(hdr->timeout); if (!rq->timeout) rq->timeout = sdev->sg_timeout; if (!rq->timeout) rq->timeout = BLK_DEFAULT_SG_TIMEOUT; if (rq->timeout < BLK_MIN_SG_TIMEOUT) rq->timeout = BLK_MIN_SG_TIMEOUT; return 0; } static int scsi_complete_sghdr_rq(struct request *rq, struct sg_io_hdr *hdr, struct bio *bio) { struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(rq); int r, ret = 0; /* * fill in all the output members */ hdr->status = scmd->result & 0xff; hdr->masked_status = sg_status_byte(scmd->result); hdr->msg_status = COMMAND_COMPLETE; hdr->host_status = host_byte(scmd->result); hdr->driver_status = 0; if (scsi_status_is_check_condition(hdr->status)) hdr->driver_status = DRIVER_SENSE; hdr->info = 0; if (hdr->masked_status || hdr->host_status || hdr->driver_status) hdr->info |= SG_INFO_CHECK; hdr->resid = scmd->resid_len; hdr->sb_len_wr = 0; if (scmd->sense_len && hdr->sbp) { int len = min((unsigned int) hdr->mx_sb_len, scmd->sense_len); if (!copy_to_user(hdr->sbp, scmd->sense_buffer, len)) hdr->sb_len_wr = len; else ret = -EFAULT; } r = blk_rq_unmap_user(bio); if (!ret) ret = r; return ret; } static int sg_io(struct scsi_device *sdev, struct sg_io_hdr *hdr, bool open_for_write) { unsigned long start_time; ssize_t ret = 0; int writing = 0; int at_head = 0; struct request *rq; struct scsi_cmnd *scmd; struct bio *bio; if (hdr->interface_id != 'S') return -EINVAL; if (hdr->dxfer_len > (queue_max_hw_sectors(sdev->request_queue) << 9)) return -EIO; if (hdr->dxfer_len) switch (hdr->dxfer_direction) { default: return -EINVAL; case SG_DXFER_TO_DEV: writing = 1; break; case SG_DXFER_TO_FROM_DEV: case SG_DXFER_FROM_DEV: break; } if (hdr->flags & SG_FLAG_Q_AT_HEAD) at_head = 1; rq = scsi_alloc_request(sdev->request_queue, writing ? REQ_OP_DRV_OUT : REQ_OP_DRV_IN, 0); if (IS_ERR(rq)) return PTR_ERR(rq); scmd = blk_mq_rq_to_pdu(rq); if (hdr->cmd_len > sizeof(scmd->cmnd)) { ret = -EINVAL; goto out_put_request; } ret = scsi_fill_sghdr_rq(sdev, rq, hdr, open_for_write); if (ret < 0) goto out_put_request; ret = blk_rq_map_user_io(rq, NULL, hdr->dxferp, hdr->dxfer_len, GFP_KERNEL, hdr->iovec_count && hdr->dxfer_len, hdr->iovec_count, 0, rq_data_dir(rq)); if (ret) goto out_put_request; bio = rq->bio; scmd->allowed = 0; start_time = jiffies; blk_execute_rq(rq, at_head); hdr->duration = jiffies_to_msecs(jiffies - start_time); ret = scsi_complete_sghdr_rq(rq, hdr, bio); out_put_request: blk_mq_free_request(rq); return ret; } /** * sg_scsi_ioctl -- handle deprecated SCSI_IOCTL_SEND_COMMAND ioctl * @q: request queue to send scsi commands down * @open_for_write: is the file / block device opened for writing? * @sic: userspace structure describing the command to perform * * Send down the scsi command described by @sic to the device below * the request queue @q. * * Notes: * - This interface is deprecated - users should use the SG_IO * interface instead, as this is a more flexible approach to * performing SCSI commands on a device. * - The SCSI command length is determined by examining the 1st byte * of the given command. There is no way to override this. * - Data transfers are limited to PAGE_SIZE * - The length (x + y) must be at least OMAX_SB_LEN bytes long to * accommodate the sense buffer when an error occurs. * The sense buffer is truncated to OMAX_SB_LEN (16) bytes so that * old code will not be surprised. * - If a Unix error occurs (e.g. ENOMEM) then the user will receive * a negative return and the Unix error code in 'errno'. * If the SCSI command succeeds then 0 is returned. * Positive numbers returned are the compacted SCSI error codes (4 * bytes in one int) where the lowest byte is the SCSI status. */ static int sg_scsi_ioctl(struct request_queue *q, bool open_for_write, struct scsi_ioctl_command __user *sic) { struct request *rq; int err; unsigned int in_len, out_len, bytes, opcode, cmdlen; struct scsi_cmnd *scmd; char *buffer = NULL; if (!sic) return -EINVAL; /* * get in an out lengths, verify they don't exceed a page worth of data */ if (get_user(in_len, &sic->inlen)) return -EFAULT; if (get_user(out_len, &sic->outlen)) return -EFAULT; if (in_len > PAGE_SIZE || out_len > PAGE_SIZE) return -EINVAL; if (get_user(opcode, &sic->data[0])) return -EFAULT; bytes = max(in_len, out_len); if (bytes) { buffer = kzalloc(bytes, GFP_NOIO | GFP_USER | __GFP_NOWARN); if (!buffer) return -ENOMEM; } rq = scsi_alloc_request(q, in_len ? REQ_OP_DRV_OUT : REQ_OP_DRV_IN, 0); if (IS_ERR(rq)) { err = PTR_ERR(rq); goto error_free_buffer; } scmd = blk_mq_rq_to_pdu(rq); cmdlen = COMMAND_SIZE(opcode); /* * get command and data to send to device, if any */ err = -EFAULT; scmd->cmd_len = cmdlen; if (copy_from_user(scmd->cmnd, sic->data, cmdlen)) goto error; if (in_len && copy_from_user(buffer, sic->data + cmdlen, in_len)) goto error; err = -EPERM; if (!scsi_cmd_allowed(scmd->cmnd, open_for_write)) goto error; /* default. possible overridden later */ scmd->allowed = 5; switch (opcode) { case SEND_DIAGNOSTIC: case FORMAT_UNIT: rq->timeout = FORMAT_UNIT_TIMEOUT; scmd->allowed = 1; break; case START_STOP: rq->timeout = START_STOP_TIMEOUT; break; case MOVE_MEDIUM: rq->timeout = MOVE_MEDIUM_TIMEOUT; break; case READ_ELEMENT_STATUS: rq->timeout = READ_ELEMENT_STATUS_TIMEOUT; break; case READ_DEFECT_DATA: rq->timeout = READ_DEFECT_DATA_TIMEOUT; scmd->allowed = 1; break; default: rq->timeout = BLK_DEFAULT_SG_TIMEOUT; break; } if (bytes) { err = blk_rq_map_kern(rq, buffer, bytes, GFP_NOIO); if (err) goto error; } blk_execute_rq(rq, false); err = scmd->result & 0xff; /* only 8 bit SCSI status */ if (err) { if (scmd->sense_len && scmd->sense_buffer) { /* limit sense len for backward compatibility */ if (copy_to_user(sic->data, scmd->sense_buffer, min(scmd->sense_len, 16U))) err = -EFAULT; } } else { if (copy_to_user(sic->data, buffer, out_len)) err = -EFAULT; } error: blk_mq_free_request(rq); error_free_buffer: kfree(buffer); return err; } int put_sg_io_hdr(const struct sg_io_hdr *hdr, void __user *argp) { #ifdef CONFIG_COMPAT if (in_compat_syscall()) { struct compat_sg_io_hdr hdr32 = { .interface_id = hdr->interface_id, .dxfer_direction = hdr->dxfer_direction, .cmd_len = hdr->cmd_len, .mx_sb_len = hdr->mx_sb_len, .iovec_count = hdr->iovec_count, .dxfer_len = hdr->dxfer_len, .dxferp = (uintptr_t)hdr->dxferp, .cmdp = (uintptr_t)hdr->cmdp, .sbp = (uintptr_t)hdr->sbp, .timeout = hdr->timeout, .flags = hdr->flags, .pack_id = hdr->pack_id, .usr_ptr = (uintptr_t)hdr->usr_ptr, .status = hdr->status, .masked_status = hdr->masked_status, .msg_status = hdr->msg_status, .sb_len_wr = hdr->sb_len_wr, .host_status = hdr->host_status, .driver_status = hdr->driver_status, .resid = hdr->resid, .duration = hdr->duration, .info = hdr->info, }; if (copy_to_user(argp, &hdr32, sizeof(hdr32))) return -EFAULT; return 0; } #endif if (copy_to_user(argp, hdr, sizeof(*hdr))) return -EFAULT; return 0; } EXPORT_SYMBOL(put_sg_io_hdr); int get_sg_io_hdr(struct sg_io_hdr *hdr, const void __user *argp) { #ifdef CONFIG_COMPAT struct compat_sg_io_hdr hdr32; if (in_compat_syscall()) { if (copy_from_user(&hdr32, argp, sizeof(hdr32))) return -EFAULT; *hdr = (struct sg_io_hdr) { .interface_id = hdr32.interface_id, .dxfer_direction = hdr32.dxfer_direction, .cmd_len = hdr32.cmd_len, .mx_sb_len = hdr32.mx_sb_len, .iovec_count = hdr32.iovec_count, .dxfer_len = hdr32.dxfer_len, .dxferp = compat_ptr(hdr32.dxferp), .cmdp = compat_ptr(hdr32.cmdp), .sbp = compat_ptr(hdr32.sbp), .timeout = hdr32.timeout, .flags = hdr32.flags, .pack_id = hdr32.pack_id, .usr_ptr = compat_ptr(hdr32.usr_ptr), .status = hdr32.status, .masked_status = hdr32.masked_status, .msg_status = hdr32.msg_status, .sb_len_wr = hdr32.sb_len_wr, .host_status = hdr32.host_status, .driver_status = hdr32.driver_status, .resid = hdr32.resid, .duration = hdr32.duration, .info = hdr32.info, }; return 0; } #endif if (copy_from_user(hdr, argp, sizeof(*hdr))) return -EFAULT; return 0; } EXPORT_SYMBOL(get_sg_io_hdr); #ifdef CONFIG_COMPAT struct compat_cdrom_generic_command { unsigned char cmd[CDROM_PACKET_SIZE]; compat_caddr_t buffer; compat_uint_t buflen; compat_int_t stat; compat_caddr_t sense; unsigned char data_direction; unsigned char pad[3]; compat_int_t quiet; compat_int_t timeout; compat_caddr_t unused; }; #endif static int scsi_get_cdrom_generic_arg(struct cdrom_generic_command *cgc, const void __user *arg) { #ifdef CONFIG_COMPAT if (in_compat_syscall()) { struct compat_cdrom_generic_command cgc32; if (copy_from_user(&cgc32, arg, sizeof(cgc32))) return -EFAULT; *cgc = (struct cdrom_generic_command) { .buffer = compat_ptr(cgc32.buffer), .buflen = cgc32.buflen, .stat = cgc32.stat, .sense = compat_ptr(cgc32.sense), .data_direction = cgc32.data_direction, .quiet = cgc32.quiet, .timeout = cgc32.timeout, .unused = compat_ptr(cgc32.unused), }; memcpy(&cgc->cmd, &cgc32.cmd, CDROM_PACKET_SIZE); return 0; } #endif if (copy_from_user(cgc, arg, sizeof(*cgc))) return -EFAULT; return 0; } static int scsi_put_cdrom_generic_arg(const struct cdrom_generic_command *cgc, void __user *arg) { #ifdef CONFIG_COMPAT if (in_compat_syscall()) { struct compat_cdrom_generic_command cgc32 = { .buffer = (uintptr_t)(cgc->buffer), .buflen = cgc->buflen, .stat = cgc->stat, .sense = (uintptr_t)(cgc->sense), .data_direction = cgc->data_direction, .quiet = cgc->quiet, .timeout = cgc->timeout, .unused = (uintptr_t)(cgc->unused), }; memcpy(&cgc32.cmd, &cgc->cmd, CDROM_PACKET_SIZE); if (copy_to_user(arg, &cgc32, sizeof(cgc32))) return -EFAULT; return 0; } #endif if (copy_to_user(arg, cgc, sizeof(*cgc))) return -EFAULT; return 0; } static int scsi_cdrom_send_packet(struct scsi_device *sdev, bool open_for_write, void __user *arg) { struct cdrom_generic_command cgc; struct sg_io_hdr hdr; int err; err = scsi_get_cdrom_generic_arg(&cgc, arg); if (err) return err; cgc.timeout = clock_t_to_jiffies(cgc.timeout); memset(&hdr, 0, sizeof(hdr)); hdr.interface_id = 'S'; hdr.cmd_len = sizeof(cgc.cmd); hdr.dxfer_len = cgc.buflen; switch (cgc.data_direction) { case CGC_DATA_UNKNOWN: hdr.dxfer_direction = SG_DXFER_UNKNOWN; break; case CGC_DATA_WRITE: hdr.dxfer_direction = SG_DXFER_TO_DEV; break; case CGC_DATA_READ: hdr.dxfer_direction = SG_DXFER_FROM_DEV; break; case CGC_DATA_NONE: hdr.dxfer_direction = SG_DXFER_NONE; break; default: return -EINVAL; } hdr.dxferp = cgc.buffer; hdr.sbp = cgc.sense; if (hdr.sbp) hdr.mx_sb_len = sizeof(struct request_sense); hdr.timeout = jiffies_to_msecs(cgc.timeout); hdr.cmdp = ((struct cdrom_generic_command __user *) arg)->cmd; hdr.cmd_len = sizeof(cgc.cmd); err = sg_io(sdev, &hdr, open_for_write); if (err == -EFAULT) return -EFAULT; if (hdr.status) return -EIO; cgc.stat = err; cgc.buflen = hdr.resid; if (scsi_put_cdrom_generic_arg(&cgc, arg)) return -EFAULT; return err; } static int scsi_ioctl_sg_io(struct scsi_device *sdev, bool open_for_write, void __user *argp) { struct sg_io_hdr hdr; int error; error = get_sg_io_hdr(&hdr, argp); if (error) return error; error = sg_io(sdev, &hdr, open_for_write); if (error == -EFAULT) return error; if (put_sg_io_hdr(&hdr, argp)) return -EFAULT; return error; } /** * scsi_ioctl - Dispatch ioctl to scsi device * @sdev: scsi device receiving ioctl * @open_for_write: is the file / block device opened for writing? * @cmd: which ioctl is it * @arg: data associated with ioctl * * Description: The scsi_ioctl() function differs from most ioctls in that it * does not take a major/minor number as the dev field. Rather, it takes * a pointer to a &struct scsi_device. * * Return: varies depending on the @cmd */ int scsi_ioctl(struct scsi_device *sdev, bool open_for_write, int cmd, void __user *arg) { struct request_queue *q = sdev->request_queue; struct scsi_sense_hdr sense_hdr; /* Check for deprecated ioctls ... all the ioctls which don't * follow the new unique numbering scheme are deprecated */ switch (cmd) { case SCSI_IOCTL_SEND_COMMAND: case SCSI_IOCTL_TEST_UNIT_READY: case SCSI_IOCTL_BENCHMARK_COMMAND: case SCSI_IOCTL_SYNC: case SCSI_IOCTL_START_UNIT: case SCSI_IOCTL_STOP_UNIT: printk(KERN_WARNING "program %s is using a deprecated SCSI " "ioctl, please convert it to SG_IO\n", current->comm); break; default: break; } switch (cmd) { case SG_GET_VERSION_NUM: return sg_get_version(arg); case SG_SET_TIMEOUT: return sg_set_timeout(sdev, arg); case SG_GET_TIMEOUT: return jiffies_to_clock_t(sdev->sg_timeout); case SG_GET_RESERVED_SIZE: return sg_get_reserved_size(sdev, arg); case SG_SET_RESERVED_SIZE: return sg_set_reserved_size(sdev, arg); case SG_EMULATED_HOST: return sg_emulated_host(q, arg); case SG_IO: return scsi_ioctl_sg_io(sdev, open_for_write, arg); case SCSI_IOCTL_SEND_COMMAND: return sg_scsi_ioctl(q, open_for_write, arg); case CDROM_SEND_PACKET: return scsi_cdrom_send_packet(sdev, open_for_write, arg); case CDROMCLOSETRAY: return scsi_send_start_stop(sdev, 3); case CDROMEJECT: return scsi_send_start_stop(sdev, 2); case SCSI_IOCTL_GET_IDLUN: return scsi_get_idlun(sdev, arg); case SCSI_IOCTL_GET_BUS_NUMBER: return put_user(sdev->host->host_no, (int __user *)arg); case SCSI_IOCTL_PROBE_HOST: return ioctl_probe(sdev->host, arg); case SCSI_IOCTL_DOORLOCK: return scsi_set_medium_removal(sdev, SCSI_REMOVAL_PREVENT); case SCSI_IOCTL_DOORUNLOCK: return scsi_set_medium_removal(sdev, SCSI_REMOVAL_ALLOW); case SCSI_IOCTL_TEST_UNIT_READY: return scsi_test_unit_ready(sdev, IOCTL_NORMAL_TIMEOUT, NORMAL_RETRIES, &sense_hdr); case SCSI_IOCTL_START_UNIT: return scsi_send_start_stop(sdev, 1); case SCSI_IOCTL_STOP_UNIT: return scsi_send_start_stop(sdev, 0); case SCSI_IOCTL_GET_PCI: return scsi_ioctl_get_pci(sdev, arg); case SG_SCSI_RESET: return scsi_ioctl_reset(sdev, arg); } #ifdef CONFIG_COMPAT if (in_compat_syscall()) { if (!sdev->host->hostt->compat_ioctl) return -EINVAL; return sdev->host->hostt->compat_ioctl(sdev, cmd, arg); } #endif if (!sdev->host->hostt->ioctl) return -EINVAL; return sdev->host->hostt->ioctl(sdev, cmd, arg); } EXPORT_SYMBOL(scsi_ioctl); /** * scsi_ioctl_block_when_processing_errors - prevent commands from being queued * @sdev: target scsi device * @cmd: which ioctl is it * @ndelay: no delay (non-blocking) * * We can process a reset even when a device isn't fully operable. * * Return: %0 on success, <0 error code. */ int scsi_ioctl_block_when_processing_errors(struct scsi_device *sdev, int cmd, bool ndelay) { if (cmd == SG_SCSI_RESET && ndelay) { if (scsi_host_in_recovery(sdev->host)) return -EAGAIN; } else { if (!scsi_block_when_processing_errors(sdev)) return -ENODEV; } return 0; } EXPORT_SYMBOL_GPL(scsi_ioctl_block_when_processing_errors); |
| 78 78 1 77 78 4 21 149 149 33 33 140 140 140 54 13 2 4 37 36 2 38 37 38 38 1 38 4 2 72 72 82 82 82 68 14 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. */ #include "queueing.h" #include "socket.h" #include "timers.h" #include "device.h" #include "ratelimiter.h" #include "peer.h" #include "messages.h" #include <linux/module.h> #include <linux/rtnetlink.h> #include <linux/inet.h> #include <linux/netdevice.h> #include <linux/inetdevice.h> #include <linux/if_arp.h> #include <linux/icmp.h> #include <linux/suspend.h> #include <net/dst_metadata.h> #include <net/gso.h> #include <net/icmp.h> #include <net/rtnetlink.h> #include <net/ip_tunnels.h> #include <net/addrconf.h> static LIST_HEAD(device_list); static int wg_open(struct net_device *dev) { struct in_device *dev_v4 = __in_dev_get_rtnl(dev); struct inet6_dev *dev_v6 = __in6_dev_get(dev); struct wg_device *wg = netdev_priv(dev); struct wg_peer *peer; int ret; if (dev_v4) { /* At some point we might put this check near the ip_rt_send_ * redirect call of ip_forward in net/ipv4/ip_forward.c, similar * to the current secpath check. */ IN_DEV_CONF_SET(dev_v4, SEND_REDIRECTS, false); IPV4_DEVCONF_ALL(dev_net(dev), SEND_REDIRECTS) = false; } if (dev_v6) dev_v6->cnf.addr_gen_mode = IN6_ADDR_GEN_MODE_NONE; mutex_lock(&wg->device_update_lock); ret = wg_socket_init(wg, wg->incoming_port); if (ret < 0) goto out; list_for_each_entry(peer, &wg->peer_list, peer_list) { wg_packet_send_staged_packets(peer); if (peer->persistent_keepalive_interval) wg_packet_send_keepalive(peer); } out: mutex_unlock(&wg->device_update_lock); return ret; } static int wg_pm_notification(struct notifier_block *nb, unsigned long action, void *data) { struct wg_device *wg; struct wg_peer *peer; /* If the machine is constantly suspending and resuming, as part of * its normal operation rather than as a somewhat rare event, then we * don't actually want to clear keys. */ if (IS_ENABLED(CONFIG_PM_AUTOSLEEP) || IS_ENABLED(CONFIG_PM_USERSPACE_AUTOSLEEP)) return 0; if (action != PM_HIBERNATION_PREPARE && action != PM_SUSPEND_PREPARE) return 0; rtnl_lock(); list_for_each_entry(wg, &device_list, device_list) { mutex_lock(&wg->device_update_lock); list_for_each_entry(peer, &wg->peer_list, peer_list) { timer_delete(&peer->timer_zero_key_material); wg_noise_handshake_clear(&peer->handshake); wg_noise_keypairs_clear(&peer->keypairs); } mutex_unlock(&wg->device_update_lock); } rtnl_unlock(); rcu_barrier(); return 0; } static struct notifier_block pm_notifier = { .notifier_call = wg_pm_notification }; static int wg_vm_notification(struct notifier_block *nb, unsigned long action, void *data) { struct wg_device *wg; struct wg_peer *peer; rtnl_lock(); list_for_each_entry(wg, &device_list, device_list) { mutex_lock(&wg->device_update_lock); list_for_each_entry(peer, &wg->peer_list, peer_list) wg_noise_expire_current_peer_keypairs(peer); mutex_unlock(&wg->device_update_lock); } rtnl_unlock(); return 0; } static struct notifier_block vm_notifier = { .notifier_call = wg_vm_notification }; static int wg_stop(struct net_device *dev) { struct wg_device *wg = netdev_priv(dev); struct wg_peer *peer; struct sk_buff *skb; mutex_lock(&wg->device_update_lock); list_for_each_entry(peer, &wg->peer_list, peer_list) { wg_packet_purge_staged_packets(peer); wg_timers_stop(peer); wg_noise_handshake_clear(&peer->handshake); wg_noise_keypairs_clear(&peer->keypairs); wg_noise_reset_last_sent_handshake(&peer->last_sent_handshake); } mutex_unlock(&wg->device_update_lock); while ((skb = ptr_ring_consume(&wg->handshake_queue.ring)) != NULL) kfree_skb(skb); atomic_set(&wg->handshake_queue_len, 0); wg_socket_reinit(wg, NULL, NULL); return 0; } static netdev_tx_t wg_xmit(struct sk_buff *skb, struct net_device *dev) { struct wg_device *wg = netdev_priv(dev); struct sk_buff_head packets; struct wg_peer *peer; struct sk_buff *next; sa_family_t family; u32 mtu; int ret; if (unlikely(!wg_check_packet_protocol(skb))) { ret = -EPROTONOSUPPORT; net_dbg_ratelimited("%s: Invalid IP packet\n", dev->name); goto err; } peer = wg_allowedips_lookup_dst(&wg->peer_allowedips, skb); if (unlikely(!peer)) { ret = -ENOKEY; if (skb->protocol == htons(ETH_P_IP)) net_dbg_ratelimited("%s: No peer has allowed IPs matching %pI4\n", dev->name, &ip_hdr(skb)->daddr); else if (skb->protocol == htons(ETH_P_IPV6)) net_dbg_ratelimited("%s: No peer has allowed IPs matching %pI6\n", dev->name, &ipv6_hdr(skb)->daddr); goto err_icmp; } family = READ_ONCE(peer->endpoint.addr.sa_family); if (unlikely(family != AF_INET && family != AF_INET6)) { ret = -EDESTADDRREQ; net_dbg_ratelimited("%s: No valid endpoint has been configured or discovered for peer %llu\n", dev->name, peer->internal_id); goto err_peer; } mtu = skb_valid_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu; __skb_queue_head_init(&packets); if (!skb_is_gso(skb)) { skb_mark_not_on_list(skb); } else { struct sk_buff *segs = skb_gso_segment(skb, 0); if (IS_ERR(segs)) { ret = PTR_ERR(segs); goto err_peer; } dev_kfree_skb(skb); skb = segs; } skb_list_walk_safe(skb, skb, next) { skb_mark_not_on_list(skb); skb = skb_share_check(skb, GFP_ATOMIC); if (unlikely(!skb)) continue; /* We only need to keep the original dst around for icmp, * so at this point we're in a position to drop it. */ skb_dst_drop(skb); PACKET_CB(skb)->mtu = mtu; __skb_queue_tail(&packets, skb); } spin_lock_bh(&peer->staged_packet_queue.lock); /* If the queue is getting too big, we start removing the oldest packets * until it's small again. We do this before adding the new packet, so * we don't remove GSO segments that are in excess. */ while (skb_queue_len(&peer->staged_packet_queue) > MAX_STAGED_PACKETS) { dev_kfree_skb(__skb_dequeue(&peer->staged_packet_queue)); DEV_STATS_INC(dev, tx_dropped); } skb_queue_splice_tail(&packets, &peer->staged_packet_queue); spin_unlock_bh(&peer->staged_packet_queue.lock); wg_packet_send_staged_packets(peer); wg_peer_put(peer); return NETDEV_TX_OK; err_peer: wg_peer_put(peer); err_icmp: if (skb->protocol == htons(ETH_P_IP)) icmp_ndo_send(skb, ICMP_DEST_UNREACH, ICMP_HOST_UNREACH, 0); else if (skb->protocol == htons(ETH_P_IPV6)) icmpv6_ndo_send(skb, ICMPV6_DEST_UNREACH, ICMPV6_ADDR_UNREACH, 0); err: DEV_STATS_INC(dev, tx_errors); kfree_skb(skb); return ret; } static const struct net_device_ops netdev_ops = { .ndo_open = wg_open, .ndo_stop = wg_stop, .ndo_start_xmit = wg_xmit, }; static void wg_destruct(struct net_device *dev) { struct wg_device *wg = netdev_priv(dev); rtnl_lock(); list_del(&wg->device_list); rtnl_unlock(); mutex_lock(&wg->device_update_lock); rcu_assign_pointer(wg->creating_net, NULL); wg->incoming_port = 0; wg_socket_reinit(wg, NULL, NULL); /* The final references are cleared in the below calls to destroy_workqueue. */ wg_peer_remove_all(wg); destroy_workqueue(wg->handshake_receive_wq); destroy_workqueue(wg->handshake_send_wq); destroy_workqueue(wg->packet_crypt_wq); wg_packet_queue_free(&wg->handshake_queue, true); wg_packet_queue_free(&wg->decrypt_queue, false); wg_packet_queue_free(&wg->encrypt_queue, false); rcu_barrier(); /* Wait for all the peers to be actually freed. */ wg_ratelimiter_uninit(); memzero_explicit(&wg->static_identity, sizeof(wg->static_identity)); kvfree(wg->index_hashtable); kvfree(wg->peer_hashtable); mutex_unlock(&wg->device_update_lock); pr_debug("%s: Interface destroyed\n", dev->name); free_netdev(dev); } static const struct device_type device_type = { .name = KBUILD_MODNAME }; static void wg_setup(struct net_device *dev) { struct wg_device *wg = netdev_priv(dev); enum { WG_NETDEV_FEATURES = NETIF_F_HW_CSUM | NETIF_F_RXCSUM | NETIF_F_SG | NETIF_F_GSO | NETIF_F_GSO_SOFTWARE | NETIF_F_HIGHDMA }; const int overhead = MESSAGE_MINIMUM_LENGTH + sizeof(struct udphdr) + max(sizeof(struct ipv6hdr), sizeof(struct iphdr)); dev->netdev_ops = &netdev_ops; dev->header_ops = &ip_tunnel_header_ops; dev->hard_header_len = 0; dev->addr_len = 0; dev->needed_headroom = DATA_PACKET_HEAD_ROOM; dev->needed_tailroom = noise_encrypted_len(MESSAGE_PADDING_MULTIPLE); dev->type = ARPHRD_NONE; dev->flags = IFF_POINTOPOINT | IFF_NOARP; dev->priv_flags |= IFF_NO_QUEUE; dev->lltx = true; dev->features |= WG_NETDEV_FEATURES; dev->hw_features |= WG_NETDEV_FEATURES; dev->hw_enc_features |= WG_NETDEV_FEATURES; dev->mtu = ETH_DATA_LEN - overhead; dev->max_mtu = round_down(INT_MAX, MESSAGE_PADDING_MULTIPLE) - overhead; dev->pcpu_stat_type = NETDEV_PCPU_STAT_TSTATS; SET_NETDEV_DEVTYPE(dev, &device_type); /* We need to keep the dst around in case of icmp replies. */ netif_keep_dst(dev); netif_set_tso_max_size(dev, GSO_MAX_SIZE); wg->dev = dev; } static int wg_newlink(struct net_device *dev, struct rtnl_newlink_params *params, struct netlink_ext_ack *extack) { struct net *link_net = rtnl_newlink_link_net(params); struct wg_device *wg = netdev_priv(dev); int ret = -ENOMEM; rcu_assign_pointer(wg->creating_net, link_net); init_rwsem(&wg->static_identity.lock); mutex_init(&wg->socket_update_lock); mutex_init(&wg->device_update_lock); wg_allowedips_init(&wg->peer_allowedips); wg_cookie_checker_init(&wg->cookie_checker, wg); INIT_LIST_HEAD(&wg->peer_list); wg->device_update_gen = 1; wg->peer_hashtable = wg_pubkey_hashtable_alloc(); if (!wg->peer_hashtable) return ret; wg->index_hashtable = wg_index_hashtable_alloc(); if (!wg->index_hashtable) goto err_free_peer_hashtable; wg->handshake_receive_wq = alloc_workqueue("wg-kex-%s", WQ_CPU_INTENSIVE | WQ_FREEZABLE, 0, dev->name); if (!wg->handshake_receive_wq) goto err_free_index_hashtable; wg->handshake_send_wq = alloc_workqueue("wg-kex-%s", WQ_UNBOUND | WQ_FREEZABLE, 0, dev->name); if (!wg->handshake_send_wq) goto err_destroy_handshake_receive; wg->packet_crypt_wq = alloc_workqueue("wg-crypt-%s", WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM, 0, dev->name); if (!wg->packet_crypt_wq) goto err_destroy_handshake_send; ret = wg_packet_queue_init(&wg->encrypt_queue, wg_packet_encrypt_worker, MAX_QUEUED_PACKETS); if (ret < 0) goto err_destroy_packet_crypt; ret = wg_packet_queue_init(&wg->decrypt_queue, wg_packet_decrypt_worker, MAX_QUEUED_PACKETS); if (ret < 0) goto err_free_encrypt_queue; ret = wg_packet_queue_init(&wg->handshake_queue, wg_packet_handshake_receive_worker, MAX_QUEUED_INCOMING_HANDSHAKES); if (ret < 0) goto err_free_decrypt_queue; ret = wg_ratelimiter_init(); if (ret < 0) goto err_free_handshake_queue; netif_threaded_enable(dev); ret = register_netdevice(dev); if (ret < 0) goto err_uninit_ratelimiter; list_add(&wg->device_list, &device_list); /* We wait until the end to assign priv_destructor, so that * register_netdevice doesn't call it for us if it fails. */ dev->priv_destructor = wg_destruct; pr_debug("%s: Interface created\n", dev->name); return ret; err_uninit_ratelimiter: wg_ratelimiter_uninit(); err_free_handshake_queue: wg_packet_queue_free(&wg->handshake_queue, false); err_free_decrypt_queue: wg_packet_queue_free(&wg->decrypt_queue, false); err_free_encrypt_queue: wg_packet_queue_free(&wg->encrypt_queue, false); err_destroy_packet_crypt: destroy_workqueue(wg->packet_crypt_wq); err_destroy_handshake_send: destroy_workqueue(wg->handshake_send_wq); err_destroy_handshake_receive: destroy_workqueue(wg->handshake_receive_wq); err_free_index_hashtable: kvfree(wg->index_hashtable); err_free_peer_hashtable: kvfree(wg->peer_hashtable); return ret; } static struct rtnl_link_ops link_ops __read_mostly = { .kind = KBUILD_MODNAME, .priv_size = sizeof(struct wg_device), .setup = wg_setup, .newlink = wg_newlink, }; static void wg_netns_pre_exit(struct net *net) { struct wg_device *wg; struct wg_peer *peer; rtnl_lock(); list_for_each_entry(wg, &device_list, device_list) { if (rcu_access_pointer(wg->creating_net) == net) { pr_debug("%s: Creating namespace exiting\n", wg->dev->name); netif_carrier_off(wg->dev); mutex_lock(&wg->device_update_lock); rcu_assign_pointer(wg->creating_net, NULL); wg_socket_reinit(wg, NULL, NULL); list_for_each_entry(peer, &wg->peer_list, peer_list) wg_socket_clear_peer_endpoint_src(peer); mutex_unlock(&wg->device_update_lock); } } rtnl_unlock(); } static struct pernet_operations pernet_ops = { .pre_exit = wg_netns_pre_exit }; int __init wg_device_init(void) { int ret; ret = register_pm_notifier(&pm_notifier); if (ret) return ret; ret = register_random_vmfork_notifier(&vm_notifier); if (ret) goto error_pm; ret = register_pernet_device(&pernet_ops); if (ret) goto error_vm; ret = rtnl_link_register(&link_ops); if (ret) goto error_pernet; return 0; error_pernet: unregister_pernet_device(&pernet_ops); error_vm: unregister_random_vmfork_notifier(&vm_notifier); error_pm: unregister_pm_notifier(&pm_notifier); return ret; } void wg_device_uninit(void) { rtnl_link_unregister(&link_ops); unregister_pernet_device(&pernet_ops); unregister_random_vmfork_notifier(&vm_notifier); unregister_pm_notifier(&pm_notifier); rcu_barrier(); } |
| 194 125 125 125 68 68 152 114 114 122 122 50 82 40 79 82 122 126 126 14 110 106 16 2 82 40 12 85 104 104 97 72 54 5 5 63 1 3 2 57 4 1 1 2 1 42 21 19 1 1 1 1 4 2 2 4 4 10 1 1 4 4 9 2 2 5 18 2 16 16 16 16 14 1 8 1 3 50 50 3 50 2 2 2 50 4 4 11 9 2 1 1 1 12 1 5 65 5 43 1 9 19 50 4 3 10 3 1 1 1 237 237 2 237 233 144 99 21 234 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 | /* * Copyright (c) 2014, Ericsson AB * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the names of the copyright holders nor the names of its * contributors may be used to endorse or promote products derived from * this software without specific prior written permission. * * Alternatively, this software may be distributed under the terms of the * GNU General Public License ("GPL") version 2 as published by the Free * Software Foundation. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. */ #include "core.h" #include "bearer.h" #include "link.h" #include "name_table.h" #include "socket.h" #include "node.h" #include "net.h" #include <net/genetlink.h> #include <linux/string_helpers.h> #include <linux/tipc_config.h> /* The legacy API had an artificial message length limit called * ULTRA_STRING_MAX_LEN. */ #define ULTRA_STRING_MAX_LEN 32768 #define TIPC_SKB_MAX TLV_SPACE(ULTRA_STRING_MAX_LEN) #define REPLY_TRUNCATED "<truncated>\n" struct tipc_nl_compat_msg { u16 cmd; int rep_type; int rep_size; int req_type; int req_size; struct net *net; struct sk_buff *rep; struct tlv_desc *req; struct sock *dst_sk; }; struct tipc_nl_compat_cmd_dump { int (*header)(struct tipc_nl_compat_msg *); int (*dumpit)(struct sk_buff *, struct netlink_callback *); int (*format)(struct tipc_nl_compat_msg *msg, struct nlattr **attrs); }; struct tipc_nl_compat_cmd_doit { int (*doit)(struct sk_buff *skb, struct genl_info *info); int (*transcode)(struct tipc_nl_compat_cmd_doit *cmd, struct sk_buff *skb, struct tipc_nl_compat_msg *msg); }; static int tipc_skb_tailroom(struct sk_buff *skb) { int tailroom; int limit; tailroom = skb_tailroom(skb); limit = TIPC_SKB_MAX - skb->len; if (tailroom < limit) return tailroom; return limit; } static inline int TLV_GET_DATA_LEN(struct tlv_desc *tlv) { return TLV_GET_LEN(tlv) - TLV_SPACE(0); } static int tipc_add_tlv(struct sk_buff *skb, u16 type, void *data, u16 len) { struct tlv_desc *tlv = (struct tlv_desc *)skb_tail_pointer(skb); if (tipc_skb_tailroom(skb) < TLV_SPACE(len)) return -EMSGSIZE; skb_put(skb, TLV_SPACE(len)); memset(tlv, 0, TLV_SPACE(len)); tlv->tlv_type = htons(type); tlv->tlv_len = htons(TLV_LENGTH(len)); if (len && data) memcpy(TLV_DATA(tlv), data, len); return 0; } static void tipc_tlv_init(struct sk_buff *skb, u16 type) { struct tlv_desc *tlv = (struct tlv_desc *)skb->data; TLV_SET_LEN(tlv, 0); TLV_SET_TYPE(tlv, type); skb_put(skb, sizeof(struct tlv_desc)); } static __printf(2, 3) int tipc_tlv_sprintf(struct sk_buff *skb, const char *fmt, ...) { int n; u16 len; u32 rem; char *buf; struct tlv_desc *tlv; va_list args; rem = tipc_skb_tailroom(skb); tlv = (struct tlv_desc *)skb->data; len = TLV_GET_LEN(tlv); buf = TLV_DATA(tlv) + len; va_start(args, fmt); n = vscnprintf(buf, rem, fmt, args); va_end(args); TLV_SET_LEN(tlv, n + len); skb_put(skb, n); return n; } static struct sk_buff *tipc_tlv_alloc(int size) { int hdr_len; struct sk_buff *buf; size = TLV_SPACE(size); hdr_len = nlmsg_total_size(GENL_HDRLEN + TIPC_GENL_HDRLEN); buf = alloc_skb(hdr_len + size, GFP_KERNEL); if (!buf) return NULL; skb_reserve(buf, hdr_len); return buf; } static struct sk_buff *tipc_get_err_tlv(char *str) { int str_len = strlen(str) + 1; struct sk_buff *buf; buf = tipc_tlv_alloc(str_len); if (buf) tipc_add_tlv(buf, TIPC_TLV_ERROR_STRING, str, str_len); return buf; } static int __tipc_nl_compat_dumpit(struct tipc_nl_compat_cmd_dump *cmd, struct tipc_nl_compat_msg *msg, struct sk_buff *arg) { struct genl_dumpit_info info; int len = 0; int err; struct sk_buff *buf; struct nlmsghdr *nlmsg; struct netlink_callback cb; struct nlattr **attrbuf; memset(&cb, 0, sizeof(cb)); cb.nlh = (struct nlmsghdr *)arg->data; cb.skb = arg; cb.data = &info; buf = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL); if (!buf) return -ENOMEM; buf->sk = msg->dst_sk; if (__tipc_dump_start(&cb, msg->net)) { kfree_skb(buf); return -ENOMEM; } attrbuf = kcalloc(tipc_genl_family.maxattr + 1, sizeof(struct nlattr *), GFP_KERNEL); if (!attrbuf) { err = -ENOMEM; goto err_out; } info.info.attrs = attrbuf; if (nlmsg_len(cb.nlh) > 0) { err = nlmsg_parse_deprecated(cb.nlh, GENL_HDRLEN, attrbuf, tipc_genl_family.maxattr, tipc_genl_family.policy, NULL); if (err) goto err_out; } do { int rem; len = (*cmd->dumpit)(buf, &cb); nlmsg_for_each_msg(nlmsg, nlmsg_hdr(buf), len, rem) { err = nlmsg_parse_deprecated(nlmsg, GENL_HDRLEN, attrbuf, tipc_genl_family.maxattr, tipc_genl_family.policy, NULL); if (err) goto err_out; err = (*cmd->format)(msg, attrbuf); if (err) goto err_out; if (tipc_skb_tailroom(msg->rep) <= 1) { err = -EMSGSIZE; goto err_out; } } skb_reset_tail_pointer(buf); buf->len = 0; } while (len); err = 0; err_out: kfree(attrbuf); tipc_dump_done(&cb); kfree_skb(buf); if (err == -EMSGSIZE) { /* The legacy API only considered messages filling * "ULTRA_STRING_MAX_LEN" to be truncated. */ if ((TIPC_SKB_MAX - msg->rep->len) <= 1) { char *tail = skb_tail_pointer(msg->rep); if (*tail != '\0') sprintf(tail - sizeof(REPLY_TRUNCATED) - 1, REPLY_TRUNCATED); } return 0; } return err; } static int tipc_nl_compat_dumpit(struct tipc_nl_compat_cmd_dump *cmd, struct tipc_nl_compat_msg *msg) { struct nlmsghdr *nlh; struct sk_buff *arg; int err; if (msg->req_type && (!msg->req_size || !TLV_CHECK_TYPE(msg->req, msg->req_type))) return -EINVAL; msg->rep = tipc_tlv_alloc(msg->rep_size); if (!msg->rep) return -ENOMEM; if (msg->rep_type) tipc_tlv_init(msg->rep, msg->rep_type); if (cmd->header) { err = (*cmd->header)(msg); if (err) { kfree_skb(msg->rep); msg->rep = NULL; return err; } } arg = nlmsg_new(0, GFP_KERNEL); if (!arg) { kfree_skb(msg->rep); msg->rep = NULL; return -ENOMEM; } nlh = nlmsg_put(arg, 0, 0, tipc_genl_family.id, 0, NLM_F_MULTI); if (!nlh) { kfree_skb(arg); kfree_skb(msg->rep); msg->rep = NULL; return -EMSGSIZE; } nlmsg_end(arg, nlh); err = __tipc_nl_compat_dumpit(cmd, msg, arg); if (err) { kfree_skb(msg->rep); msg->rep = NULL; } kfree_skb(arg); return err; } static int __tipc_nl_compat_doit(struct tipc_nl_compat_cmd_doit *cmd, struct tipc_nl_compat_msg *msg) { int err; struct sk_buff *doit_buf; struct sk_buff *trans_buf; struct nlattr **attrbuf; struct genl_info info; trans_buf = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL); if (!trans_buf) return -ENOMEM; attrbuf = kmalloc_array(tipc_genl_family.maxattr + 1, sizeof(struct nlattr *), GFP_KERNEL); if (!attrbuf) { err = -ENOMEM; goto trans_out; } doit_buf = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL); if (!doit_buf) { err = -ENOMEM; goto attrbuf_out; } memset(&info, 0, sizeof(info)); info.attrs = attrbuf; rtnl_lock(); err = (*cmd->transcode)(cmd, trans_buf, msg); if (err) goto doit_out; err = nla_parse_deprecated(attrbuf, tipc_genl_family.maxattr, (const struct nlattr *)trans_buf->data, trans_buf->len, NULL, NULL); if (err) goto doit_out; doit_buf->sk = msg->dst_sk; err = (*cmd->doit)(doit_buf, &info); doit_out: rtnl_unlock(); kfree_skb(doit_buf); attrbuf_out: kfree(attrbuf); trans_out: kfree_skb(trans_buf); return err; } static int tipc_nl_compat_doit(struct tipc_nl_compat_cmd_doit *cmd, struct tipc_nl_compat_msg *msg) { int err; if (msg->req_type && (!msg->req_size || !TLV_CHECK_TYPE(msg->req, msg->req_type))) return -EINVAL; err = __tipc_nl_compat_doit(cmd, msg); if (err) return err; /* The legacy API considered an empty message a success message */ msg->rep = tipc_tlv_alloc(0); if (!msg->rep) return -ENOMEM; return 0; } static int tipc_nl_compat_bearer_dump(struct tipc_nl_compat_msg *msg, struct nlattr **attrs) { struct nlattr *bearer[TIPC_NLA_BEARER_MAX + 1]; int err; if (!attrs[TIPC_NLA_BEARER]) return -EINVAL; err = nla_parse_nested_deprecated(bearer, TIPC_NLA_BEARER_MAX, attrs[TIPC_NLA_BEARER], NULL, NULL); if (err) return err; return tipc_add_tlv(msg->rep, TIPC_TLV_BEARER_NAME, nla_data(bearer[TIPC_NLA_BEARER_NAME]), nla_len(bearer[TIPC_NLA_BEARER_NAME])); } static int tipc_nl_compat_bearer_enable(struct tipc_nl_compat_cmd_doit *cmd, struct sk_buff *skb, struct tipc_nl_compat_msg *msg) { struct nlattr *prop; struct nlattr *bearer; struct tipc_bearer_config *b; int len; b = (struct tipc_bearer_config *)TLV_DATA(msg->req); bearer = nla_nest_start_noflag(skb, TIPC_NLA_BEARER); if (!bearer) return -EMSGSIZE; len = TLV_GET_DATA_LEN(msg->req); len -= offsetof(struct tipc_bearer_config, name); if (len <= 0) return -EINVAL; len = min_t(int, len, TIPC_MAX_BEARER_NAME); if (!string_is_terminated(b->name, len)) return -EINVAL; if (nla_put_string(skb, TIPC_NLA_BEARER_NAME, b->name)) return -EMSGSIZE; if (nla_put_u32(skb, TIPC_NLA_BEARER_DOMAIN, ntohl(b->disc_domain))) return -EMSGSIZE; if (ntohl(b->priority) <= TIPC_MAX_LINK_PRI) { prop = nla_nest_start_noflag(skb, TIPC_NLA_BEARER_PROP); if (!prop) return -EMSGSIZE; if (nla_put_u32(skb, TIPC_NLA_PROP_PRIO, ntohl(b->priority))) return -EMSGSIZE; nla_nest_end(skb, prop); } nla_nest_end(skb, bearer); return 0; } static int tipc_nl_compat_bearer_disable(struct tipc_nl_compat_cmd_doit *cmd, struct sk_buff *skb, struct tipc_nl_compat_msg *msg) { char *name; struct nlattr *bearer; int len; name = (char *)TLV_DATA(msg->req); bearer = nla_nest_start_noflag(skb, TIPC_NLA_BEARER); if (!bearer) return -EMSGSIZE; len = TLV_GET_DATA_LEN(msg->req); if (len <= 0) return -EINVAL; len = min_t(int, len, TIPC_MAX_BEARER_NAME); if (!string_is_terminated(name, len)) return -EINVAL; if (nla_put_string(skb, TIPC_NLA_BEARER_NAME, name)) return -EMSGSIZE; nla_nest_end(skb, bearer); return 0; } static inline u32 perc(u32 count, u32 total) { return (count * 100 + (total / 2)) / total; } static void __fill_bc_link_stat(struct tipc_nl_compat_msg *msg, struct nlattr *prop[], struct nlattr *stats[]) { tipc_tlv_sprintf(msg->rep, " Window:%u packets\n", nla_get_u32(prop[TIPC_NLA_PROP_WIN])); tipc_tlv_sprintf(msg->rep, " RX packets:%u fragments:%u/%u bundles:%u/%u\n", nla_get_u32(stats[TIPC_NLA_STATS_RX_INFO]), nla_get_u32(stats[TIPC_NLA_STATS_RX_FRAGMENTS]), nla_get_u32(stats[TIPC_NLA_STATS_RX_FRAGMENTED]), nla_get_u32(stats[TIPC_NLA_STATS_RX_BUNDLES]), nla_get_u32(stats[TIPC_NLA_STATS_RX_BUNDLED])); tipc_tlv_sprintf(msg->rep, " TX packets:%u fragments:%u/%u bundles:%u/%u\n", nla_get_u32(stats[TIPC_NLA_STATS_TX_INFO]), nla_get_u32(stats[TIPC_NLA_STATS_TX_FRAGMENTS]), nla_get_u32(stats[TIPC_NLA_STATS_TX_FRAGMENTED]), nla_get_u32(stats[TIPC_NLA_STATS_TX_BUNDLES]), nla_get_u32(stats[TIPC_NLA_STATS_TX_BUNDLED])); tipc_tlv_sprintf(msg->rep, " RX naks:%u defs:%u dups:%u\n", nla_get_u32(stats[TIPC_NLA_STATS_RX_NACKS]), nla_get_u32(stats[TIPC_NLA_STATS_RX_DEFERRED]), nla_get_u32(stats[TIPC_NLA_STATS_DUPLICATES])); tipc_tlv_sprintf(msg->rep, " TX naks:%u acks:%u dups:%u\n", nla_get_u32(stats[TIPC_NLA_STATS_TX_NACKS]), nla_get_u32(stats[TIPC_NLA_STATS_TX_ACKS]), nla_get_u32(stats[TIPC_NLA_STATS_RETRANSMITTED])); tipc_tlv_sprintf(msg->rep, " Congestion link:%u Send queue max:%u avg:%u", nla_get_u32(stats[TIPC_NLA_STATS_LINK_CONGS]), nla_get_u32(stats[TIPC_NLA_STATS_MAX_QUEUE]), nla_get_u32(stats[TIPC_NLA_STATS_AVG_QUEUE])); } static int tipc_nl_compat_link_stat_dump(struct tipc_nl_compat_msg *msg, struct nlattr **attrs) { char *name; struct nlattr *link[TIPC_NLA_LINK_MAX + 1]; struct nlattr *prop[TIPC_NLA_PROP_MAX + 1]; struct nlattr *stats[TIPC_NLA_STATS_MAX + 1]; int err; int len; if (!attrs[TIPC_NLA_LINK]) return -EINVAL; err = nla_parse_nested_deprecated(link, TIPC_NLA_LINK_MAX, attrs[TIPC_NLA_LINK], NULL, NULL); if (err) return err; if (!link[TIPC_NLA_LINK_PROP]) return -EINVAL; err = nla_parse_nested_deprecated(prop, TIPC_NLA_PROP_MAX, link[TIPC_NLA_LINK_PROP], NULL, NULL); if (err) return err; if (!link[TIPC_NLA_LINK_STATS]) return -EINVAL; err = nla_parse_nested_deprecated(stats, TIPC_NLA_STATS_MAX, link[TIPC_NLA_LINK_STATS], NULL, NULL); if (err) return err; name = (char *)TLV_DATA(msg->req); len = TLV_GET_DATA_LEN(msg->req); if (len <= 0) return -EINVAL; len = min_t(int, len, TIPC_MAX_LINK_NAME); if (!string_is_terminated(name, len)) return -EINVAL; if (strcmp(name, nla_data(link[TIPC_NLA_LINK_NAME])) != 0) return 0; tipc_tlv_sprintf(msg->rep, "\nLink <%s>\n", (char *)nla_data(link[TIPC_NLA_LINK_NAME])); if (link[TIPC_NLA_LINK_BROADCAST]) { __fill_bc_link_stat(msg, prop, stats); return 0; } if (link[TIPC_NLA_LINK_ACTIVE]) tipc_tlv_sprintf(msg->rep, " ACTIVE"); else if (link[TIPC_NLA_LINK_UP]) tipc_tlv_sprintf(msg->rep, " STANDBY"); else tipc_tlv_sprintf(msg->rep, " DEFUNCT"); tipc_tlv_sprintf(msg->rep, " MTU:%u Priority:%u", nla_get_u32(link[TIPC_NLA_LINK_MTU]), nla_get_u32(prop[TIPC_NLA_PROP_PRIO])); tipc_tlv_sprintf(msg->rep, " Tolerance:%u ms Window:%u packets\n", nla_get_u32(prop[TIPC_NLA_PROP_TOL]), nla_get_u32(prop[TIPC_NLA_PROP_WIN])); tipc_tlv_sprintf(msg->rep, " RX packets:%u fragments:%u/%u bundles:%u/%u\n", nla_get_u32(link[TIPC_NLA_LINK_RX]) - nla_get_u32(stats[TIPC_NLA_STATS_RX_INFO]), nla_get_u32(stats[TIPC_NLA_STATS_RX_FRAGMENTS]), nla_get_u32(stats[TIPC_NLA_STATS_RX_FRAGMENTED]), nla_get_u32(stats[TIPC_NLA_STATS_RX_BUNDLES]), nla_get_u32(stats[TIPC_NLA_STATS_RX_BUNDLED])); tipc_tlv_sprintf(msg->rep, " TX packets:%u fragments:%u/%u bundles:%u/%u\n", nla_get_u32(link[TIPC_NLA_LINK_TX]) - nla_get_u32(stats[TIPC_NLA_STATS_TX_INFO]), nla_get_u32(stats[TIPC_NLA_STATS_TX_FRAGMENTS]), nla_get_u32(stats[TIPC_NLA_STATS_TX_FRAGMENTED]), nla_get_u32(stats[TIPC_NLA_STATS_TX_BUNDLES]), nla_get_u32(stats[TIPC_NLA_STATS_TX_BUNDLED])); tipc_tlv_sprintf(msg->rep, " TX profile sample:%u packets average:%u octets\n", nla_get_u32(stats[TIPC_NLA_STATS_MSG_LEN_CNT]), nla_get_u32(stats[TIPC_NLA_STATS_MSG_LEN_TOT]) / nla_get_u32(stats[TIPC_NLA_STATS_MSG_PROF_TOT])); tipc_tlv_sprintf(msg->rep, " 0-64:%u%% -256:%u%% -1024:%u%% -4096:%u%% ", perc(nla_get_u32(stats[TIPC_NLA_STATS_MSG_LEN_P0]), nla_get_u32(stats[TIPC_NLA_STATS_MSG_PROF_TOT])), perc(nla_get_u32(stats[TIPC_NLA_STATS_MSG_LEN_P1]), nla_get_u32(stats[TIPC_NLA_STATS_MSG_PROF_TOT])), perc(nla_get_u32(stats[TIPC_NLA_STATS_MSG_LEN_P2]), nla_get_u32(stats[TIPC_NLA_STATS_MSG_PROF_TOT])), perc(nla_get_u32(stats[TIPC_NLA_STATS_MSG_LEN_P3]), nla_get_u32(stats[TIPC_NLA_STATS_MSG_PROF_TOT]))); tipc_tlv_sprintf(msg->rep, "-16384:%u%% -32768:%u%% -66000:%u%%\n", perc(nla_get_u32(stats[TIPC_NLA_STATS_MSG_LEN_P4]), nla_get_u32(stats[TIPC_NLA_STATS_MSG_PROF_TOT])), perc(nla_get_u32(stats[TIPC_NLA_STATS_MSG_LEN_P5]), nla_get_u32(stats[TIPC_NLA_STATS_MSG_PROF_TOT])), perc(nla_get_u32(stats[TIPC_NLA_STATS_MSG_LEN_P6]), nla_get_u32(stats[TIPC_NLA_STATS_MSG_PROF_TOT]))); tipc_tlv_sprintf(msg->rep, " RX states:%u probes:%u naks:%u defs:%u dups:%u\n", nla_get_u32(stats[TIPC_NLA_STATS_RX_STATES]), nla_get_u32(stats[TIPC_NLA_STATS_RX_PROBES]), nla_get_u32(stats[TIPC_NLA_STATS_RX_NACKS]), nla_get_u32(stats[TIPC_NLA_STATS_RX_DEFERRED]), nla_get_u32(stats[TIPC_NLA_STATS_DUPLICATES])); tipc_tlv_sprintf(msg->rep, " TX states:%u probes:%u naks:%u acks:%u dups:%u\n", nla_get_u32(stats[TIPC_NLA_STATS_TX_STATES]), nla_get_u32(stats[TIPC_NLA_STATS_TX_PROBES]), nla_get_u32(stats[TIPC_NLA_STATS_TX_NACKS]), nla_get_u32(stats[TIPC_NLA_STATS_TX_ACKS]), nla_get_u32(stats[TIPC_NLA_STATS_RETRANSMITTED])); tipc_tlv_sprintf(msg->rep, " Congestion link:%u Send queue max:%u avg:%u", nla_get_u32(stats[TIPC_NLA_STATS_LINK_CONGS]), nla_get_u32(stats[TIPC_NLA_STATS_MAX_QUEUE]), nla_get_u32(stats[TIPC_NLA_STATS_AVG_QUEUE])); return 0; } static int tipc_nl_compat_link_dump(struct tipc_nl_compat_msg *msg, struct nlattr **attrs) { struct nlattr *link[TIPC_NLA_LINK_MAX + 1]; struct tipc_link_info link_info; int err; if (!attrs[TIPC_NLA_LINK]) return -EINVAL; err = nla_parse_nested_deprecated(link, TIPC_NLA_LINK_MAX, attrs[TIPC_NLA_LINK], NULL, NULL); if (err) return err; link_info.dest = htonl(nla_get_flag(link[TIPC_NLA_LINK_DEST])); link_info.up = htonl(nla_get_flag(link[TIPC_NLA_LINK_UP])); nla_strscpy(link_info.str, link[TIPC_NLA_LINK_NAME], TIPC_MAX_LINK_NAME); return tipc_add_tlv(msg->rep, TIPC_TLV_LINK_INFO, &link_info, sizeof(link_info)); } static int __tipc_add_link_prop(struct sk_buff *skb, struct tipc_nl_compat_msg *msg, struct tipc_link_config *lc) { switch (msg->cmd) { case TIPC_CMD_SET_LINK_PRI: return nla_put_u32(skb, TIPC_NLA_PROP_PRIO, ntohl(lc->value)); case TIPC_CMD_SET_LINK_TOL: return nla_put_u32(skb, TIPC_NLA_PROP_TOL, ntohl(lc->value)); case TIPC_CMD_SET_LINK_WINDOW: return nla_put_u32(skb, TIPC_NLA_PROP_WIN, ntohl(lc->value)); } return -EINVAL; } static int tipc_nl_compat_media_set(struct sk_buff *skb, struct tipc_nl_compat_msg *msg) { struct nlattr *prop; struct nlattr *media; struct tipc_link_config *lc; lc = (struct tipc_link_config *)TLV_DATA(msg->req); media = nla_nest_start_noflag(skb, TIPC_NLA_MEDIA); if (!media) return -EMSGSIZE; if (nla_put_string(skb, TIPC_NLA_MEDIA_NAME, lc->name)) return -EMSGSIZE; prop = nla_nest_start_noflag(skb, TIPC_NLA_MEDIA_PROP); if (!prop) return -EMSGSIZE; __tipc_add_link_prop(skb, msg, lc); nla_nest_end(skb, prop); nla_nest_end(skb, media); return 0; } static int tipc_nl_compat_bearer_set(struct sk_buff *skb, struct tipc_nl_compat_msg *msg) { struct nlattr *prop; struct nlattr *bearer; struct tipc_link_config *lc; lc = (struct tipc_link_config *)TLV_DATA(msg->req); bearer = nla_nest_start_noflag(skb, TIPC_NLA_BEARER); if (!bearer) return -EMSGSIZE; if (nla_put_string(skb, TIPC_NLA_BEARER_NAME, lc->name)) return -EMSGSIZE; prop = nla_nest_start_noflag(skb, TIPC_NLA_BEARER_PROP); if (!prop) return -EMSGSIZE; __tipc_add_link_prop(skb, msg, lc); nla_nest_end(skb, prop); nla_nest_end(skb, bearer); return 0; } static int __tipc_nl_compat_link_set(struct sk_buff *skb, struct tipc_nl_compat_msg *msg) { struct nlattr *prop; struct nlattr *link; struct tipc_link_config *lc; lc = (struct tipc_link_config *)TLV_DATA(msg->req); link = nla_nest_start_noflag(skb, TIPC_NLA_LINK); if (!link) return -EMSGSIZE; if (nla_put_string(skb, TIPC_NLA_LINK_NAME, lc->name)) return -EMSGSIZE; prop = nla_nest_start_noflag(skb, TIPC_NLA_LINK_PROP); if (!prop) return -EMSGSIZE; __tipc_add_link_prop(skb, msg, lc); nla_nest_end(skb, prop); nla_nest_end(skb, link); return 0; } static int tipc_nl_compat_link_set(struct tipc_nl_compat_cmd_doit *cmd, struct sk_buff *skb, struct tipc_nl_compat_msg *msg) { struct tipc_link_config *lc; struct tipc_bearer *bearer; struct tipc_media *media; int len; lc = (struct tipc_link_config *)TLV_DATA(msg->req); len = TLV_GET_DATA_LEN(msg->req); len -= offsetof(struct tipc_link_config, name); if (len <= 0) return -EINVAL; len = min_t(int, len, TIPC_MAX_LINK_NAME); if (!string_is_terminated(lc->name, len)) return -EINVAL; media = tipc_media_find(lc->name); if (media) { cmd->doit = &__tipc_nl_media_set; return tipc_nl_compat_media_set(skb, msg); } bearer = tipc_bearer_find(msg->net, lc->name); if (bearer) { cmd->doit = &__tipc_nl_bearer_set; return tipc_nl_compat_bearer_set(skb, msg); } return __tipc_nl_compat_link_set(skb, msg); } static int tipc_nl_compat_link_reset_stats(struct tipc_nl_compat_cmd_doit *cmd, struct sk_buff *skb, struct tipc_nl_compat_msg *msg) { char *name; struct nlattr *link; int len; name = (char *)TLV_DATA(msg->req); link = nla_nest_start_noflag(skb, TIPC_NLA_LINK); if (!link) return -EMSGSIZE; len = TLV_GET_DATA_LEN(msg->req); if (len <= 0) return -EINVAL; len = min_t(int, len, TIPC_MAX_LINK_NAME); if (!string_is_terminated(name, len)) return -EINVAL; if (nla_put_string(skb, TIPC_NLA_LINK_NAME, name)) return -EMSGSIZE; nla_nest_end(skb, link); return 0; } static int tipc_nl_compat_name_table_dump_header(struct tipc_nl_compat_msg *msg) { int i; u32 depth; struct tipc_name_table_query *ntq; static const char * const header[] = { "Type ", "Lower Upper ", "Port Identity ", "Publication Scope" }; ntq = (struct tipc_name_table_query *)TLV_DATA(msg->req); if (TLV_GET_DATA_LEN(msg->req) < (int)sizeof(struct tipc_name_table_query)) return -EINVAL; depth = ntohl(ntq->depth); if (depth > 4) depth = 4; for (i = 0; i < depth; i++) tipc_tlv_sprintf(msg->rep, header[i]); tipc_tlv_sprintf(msg->rep, "\n"); return 0; } static int tipc_nl_compat_name_table_dump(struct tipc_nl_compat_msg *msg, struct nlattr **attrs) { char port_str[27]; struct tipc_name_table_query *ntq; struct nlattr *nt[TIPC_NLA_NAME_TABLE_MAX + 1]; struct nlattr *publ[TIPC_NLA_PUBL_MAX + 1]; u32 node, depth, type, lowbound, upbound; static const char * const scope_str[] = {"", " zone", " cluster", " node"}; int err; if (!attrs[TIPC_NLA_NAME_TABLE]) return -EINVAL; err = nla_parse_nested_deprecated(nt, TIPC_NLA_NAME_TABLE_MAX, attrs[TIPC_NLA_NAME_TABLE], NULL, NULL); if (err) return err; if (!nt[TIPC_NLA_NAME_TABLE_PUBL]) return -EINVAL; err = nla_parse_nested_deprecated(publ, TIPC_NLA_PUBL_MAX, nt[TIPC_NLA_NAME_TABLE_PUBL], NULL, NULL); if (err) return err; ntq = (struct tipc_name_table_query *)TLV_DATA(msg->req); depth = ntohl(ntq->depth); type = ntohl(ntq->type); lowbound = ntohl(ntq->lowbound); upbound = ntohl(ntq->upbound); if (!(depth & TIPC_NTQ_ALLTYPES) && (type != nla_get_u32(publ[TIPC_NLA_PUBL_TYPE]))) return 0; if (lowbound && (lowbound > nla_get_u32(publ[TIPC_NLA_PUBL_UPPER]))) return 0; if (upbound && (upbound < nla_get_u32(publ[TIPC_NLA_PUBL_LOWER]))) return 0; tipc_tlv_sprintf(msg->rep, "%-10u ", nla_get_u32(publ[TIPC_NLA_PUBL_TYPE])); if (depth == 1) goto out; tipc_tlv_sprintf(msg->rep, "%-10u %-10u ", nla_get_u32(publ[TIPC_NLA_PUBL_LOWER]), nla_get_u32(publ[TIPC_NLA_PUBL_UPPER])); if (depth == 2) goto out; node = nla_get_u32(publ[TIPC_NLA_PUBL_NODE]); sprintf(port_str, "<%u.%u.%u:%u>", tipc_zone(node), tipc_cluster(node), tipc_node(node), nla_get_u32(publ[TIPC_NLA_PUBL_REF])); tipc_tlv_sprintf(msg->rep, "%-26s ", port_str); if (depth == 3) goto out; tipc_tlv_sprintf(msg->rep, "%-10u %s", nla_get_u32(publ[TIPC_NLA_PUBL_KEY]), scope_str[nla_get_u32(publ[TIPC_NLA_PUBL_SCOPE])]); out: tipc_tlv_sprintf(msg->rep, "\n"); return 0; } static int __tipc_nl_compat_publ_dump(struct tipc_nl_compat_msg *msg, struct nlattr **attrs) { u32 type, lower, upper; struct nlattr *publ[TIPC_NLA_PUBL_MAX + 1]; int err; if (!attrs[TIPC_NLA_PUBL]) return -EINVAL; err = nla_parse_nested_deprecated(publ, TIPC_NLA_PUBL_MAX, attrs[TIPC_NLA_PUBL], NULL, NULL); if (err) return err; type = nla_get_u32(publ[TIPC_NLA_PUBL_TYPE]); lower = nla_get_u32(publ[TIPC_NLA_PUBL_LOWER]); upper = nla_get_u32(publ[TIPC_NLA_PUBL_UPPER]); if (lower == upper) tipc_tlv_sprintf(msg->rep, " {%u,%u}", type, lower); else tipc_tlv_sprintf(msg->rep, " {%u,%u,%u}", type, lower, upper); return 0; } static int tipc_nl_compat_publ_dump(struct tipc_nl_compat_msg *msg, u32 sock) { int err; void *hdr; struct nlattr *nest; struct sk_buff *args; struct tipc_nl_compat_cmd_dump dump; args = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL); if (!args) return -ENOMEM; hdr = genlmsg_put(args, 0, 0, &tipc_genl_family, NLM_F_MULTI, TIPC_NL_PUBL_GET); if (!hdr) { kfree_skb(args); return -EMSGSIZE; } nest = nla_nest_start_noflag(args, TIPC_NLA_SOCK); if (!nest) { kfree_skb(args); return -EMSGSIZE; } if (nla_put_u32(args, TIPC_NLA_SOCK_REF, sock)) { kfree_skb(args); return -EMSGSIZE; } nla_nest_end(args, nest); genlmsg_end(args, hdr); dump.dumpit = tipc_nl_publ_dump; dump.format = __tipc_nl_compat_publ_dump; err = __tipc_nl_compat_dumpit(&dump, msg, args); kfree_skb(args); return err; } static int tipc_nl_compat_sk_dump(struct tipc_nl_compat_msg *msg, struct nlattr **attrs) { int err; u32 sock_ref; struct nlattr *sock[TIPC_NLA_SOCK_MAX + 1]; if (!attrs[TIPC_NLA_SOCK]) return -EINVAL; err = nla_parse_nested_deprecated(sock, TIPC_NLA_SOCK_MAX, attrs[TIPC_NLA_SOCK], NULL, NULL); if (err) return err; sock_ref = nla_get_u32(sock[TIPC_NLA_SOCK_REF]); tipc_tlv_sprintf(msg->rep, "%u:", sock_ref); if (sock[TIPC_NLA_SOCK_CON]) { u32 node; struct nlattr *con[TIPC_NLA_CON_MAX + 1]; err = nla_parse_nested_deprecated(con, TIPC_NLA_CON_MAX, sock[TIPC_NLA_SOCK_CON], NULL, NULL); if (err) return err; node = nla_get_u32(con[TIPC_NLA_CON_NODE]); tipc_tlv_sprintf(msg->rep, " connected to <%u.%u.%u:%u>", tipc_zone(node), tipc_cluster(node), tipc_node(node), nla_get_u32(con[TIPC_NLA_CON_SOCK])); if (con[TIPC_NLA_CON_FLAG]) tipc_tlv_sprintf(msg->rep, " via {%u,%u}\n", nla_get_u32(con[TIPC_NLA_CON_TYPE]), nla_get_u32(con[TIPC_NLA_CON_INST])); else tipc_tlv_sprintf(msg->rep, "\n"); } else if (sock[TIPC_NLA_SOCK_HAS_PUBL]) { tipc_tlv_sprintf(msg->rep, " bound to"); err = tipc_nl_compat_publ_dump(msg, sock_ref); if (err) return err; } tipc_tlv_sprintf(msg->rep, "\n"); return 0; } static int tipc_nl_compat_media_dump(struct tipc_nl_compat_msg *msg, struct nlattr **attrs) { struct nlattr *media[TIPC_NLA_MEDIA_MAX + 1]; int err; if (!attrs[TIPC_NLA_MEDIA]) return -EINVAL; err = nla_parse_nested_deprecated(media, TIPC_NLA_MEDIA_MAX, attrs[TIPC_NLA_MEDIA], NULL, NULL); if (err) return err; return tipc_add_tlv(msg->rep, TIPC_TLV_MEDIA_NAME, nla_data(media[TIPC_NLA_MEDIA_NAME]), nla_len(media[TIPC_NLA_MEDIA_NAME])); } static int tipc_nl_compat_node_dump(struct tipc_nl_compat_msg *msg, struct nlattr **attrs) { struct tipc_node_info node_info; struct nlattr *node[TIPC_NLA_NODE_MAX + 1]; int err; if (!attrs[TIPC_NLA_NODE]) return -EINVAL; err = nla_parse_nested_deprecated(node, TIPC_NLA_NODE_MAX, attrs[TIPC_NLA_NODE], NULL, NULL); if (err) return err; node_info.addr = htonl(nla_get_u32(node[TIPC_NLA_NODE_ADDR])); node_info.up = htonl(nla_get_flag(node[TIPC_NLA_NODE_UP])); return tipc_add_tlv(msg->rep, TIPC_TLV_NODE_INFO, &node_info, sizeof(node_info)); } static int tipc_nl_compat_net_set(struct tipc_nl_compat_cmd_doit *cmd, struct sk_buff *skb, struct tipc_nl_compat_msg *msg) { u32 val; struct nlattr *net; val = ntohl(*(__be32 *)TLV_DATA(msg->req)); net = nla_nest_start_noflag(skb, TIPC_NLA_NET); if (!net) return -EMSGSIZE; if (msg->cmd == TIPC_CMD_SET_NODE_ADDR) { if (nla_put_u32(skb, TIPC_NLA_NET_ADDR, val)) return -EMSGSIZE; } else if (msg->cmd == TIPC_CMD_SET_NETID) { if (nla_put_u32(skb, TIPC_NLA_NET_ID, val)) return -EMSGSIZE; } nla_nest_end(skb, net); return 0; } static int tipc_nl_compat_net_dump(struct tipc_nl_compat_msg *msg, struct nlattr **attrs) { __be32 id; struct nlattr *net[TIPC_NLA_NET_MAX + 1]; int err; if (!attrs[TIPC_NLA_NET]) return -EINVAL; err = nla_parse_nested_deprecated(net, TIPC_NLA_NET_MAX, attrs[TIPC_NLA_NET], NULL, NULL); if (err) return err; id = htonl(nla_get_u32(net[TIPC_NLA_NET_ID])); return tipc_add_tlv(msg->rep, TIPC_TLV_UNSIGNED, &id, sizeof(id)); } static int tipc_cmd_show_stats_compat(struct tipc_nl_compat_msg *msg) { msg->rep = tipc_tlv_alloc(ULTRA_STRING_MAX_LEN); if (!msg->rep) return -ENOMEM; tipc_tlv_init(msg->rep, TIPC_TLV_ULTRA_STRING); tipc_tlv_sprintf(msg->rep, "TIPC version " TIPC_MOD_VER "\n"); return 0; } static int tipc_nl_compat_handle(struct tipc_nl_compat_msg *msg) { struct tipc_nl_compat_cmd_dump dump; struct tipc_nl_compat_cmd_doit doit; memset(&dump, 0, sizeof(dump)); memset(&doit, 0, sizeof(doit)); switch (msg->cmd) { case TIPC_CMD_NOOP: msg->rep = tipc_tlv_alloc(0); if (!msg->rep) return -ENOMEM; return 0; case TIPC_CMD_GET_BEARER_NAMES: msg->rep_size = MAX_BEARERS * TLV_SPACE(TIPC_MAX_BEARER_NAME); dump.dumpit = tipc_nl_bearer_dump; dump.format = tipc_nl_compat_bearer_dump; return tipc_nl_compat_dumpit(&dump, msg); case TIPC_CMD_ENABLE_BEARER: msg->req_type = TIPC_TLV_BEARER_CONFIG; doit.doit = __tipc_nl_bearer_enable; doit.transcode = tipc_nl_compat_bearer_enable; return tipc_nl_compat_doit(&doit, msg); case TIPC_CMD_DISABLE_BEARER: msg->req_type = TIPC_TLV_BEARER_NAME; doit.doit = __tipc_nl_bearer_disable; doit.transcode = tipc_nl_compat_bearer_disable; return tipc_nl_compat_doit(&doit, msg); case TIPC_CMD_SHOW_LINK_STATS: msg->req_type = TIPC_TLV_LINK_NAME; msg->rep_size = ULTRA_STRING_MAX_LEN; msg->rep_type = TIPC_TLV_ULTRA_STRING; dump.dumpit = tipc_nl_node_dump_link; dump.format = tipc_nl_compat_link_stat_dump; return tipc_nl_compat_dumpit(&dump, msg); case TIPC_CMD_GET_LINKS: msg->req_type = TIPC_TLV_NET_ADDR; msg->rep_size = ULTRA_STRING_MAX_LEN; dump.dumpit = tipc_nl_node_dump_link; dump.format = tipc_nl_compat_link_dump; return tipc_nl_compat_dumpit(&dump, msg); case TIPC_CMD_SET_LINK_TOL: case TIPC_CMD_SET_LINK_PRI: case TIPC_CMD_SET_LINK_WINDOW: msg->req_type = TIPC_TLV_LINK_CONFIG; doit.doit = tipc_nl_node_set_link; doit.transcode = tipc_nl_compat_link_set; return tipc_nl_compat_doit(&doit, msg); case TIPC_CMD_RESET_LINK_STATS: msg->req_type = TIPC_TLV_LINK_NAME; doit.doit = tipc_nl_node_reset_link_stats; doit.transcode = tipc_nl_compat_link_reset_stats; return tipc_nl_compat_doit(&doit, msg); case TIPC_CMD_SHOW_NAME_TABLE: msg->req_type = TIPC_TLV_NAME_TBL_QUERY; msg->rep_size = ULTRA_STRING_MAX_LEN; msg->rep_type = TIPC_TLV_ULTRA_STRING; dump.header = tipc_nl_compat_name_table_dump_header; dump.dumpit = tipc_nl_name_table_dump; dump.format = tipc_nl_compat_name_table_dump; return tipc_nl_compat_dumpit(&dump, msg); case TIPC_CMD_SHOW_PORTS: msg->rep_size = ULTRA_STRING_MAX_LEN; msg->rep_type = TIPC_TLV_ULTRA_STRING; dump.dumpit = tipc_nl_sk_dump; dump.format = tipc_nl_compat_sk_dump; return tipc_nl_compat_dumpit(&dump, msg); case TIPC_CMD_GET_MEDIA_NAMES: msg->rep_size = MAX_MEDIA * TLV_SPACE(TIPC_MAX_MEDIA_NAME); dump.dumpit = tipc_nl_media_dump; dump.format = tipc_nl_compat_media_dump; return tipc_nl_compat_dumpit(&dump, msg); case TIPC_CMD_GET_NODES: msg->rep_size = ULTRA_STRING_MAX_LEN; dump.dumpit = tipc_nl_node_dump; dump.format = tipc_nl_compat_node_dump; return tipc_nl_compat_dumpit(&dump, msg); case TIPC_CMD_SET_NODE_ADDR: msg->req_type = TIPC_TLV_NET_ADDR; doit.doit = __tipc_nl_net_set; doit.transcode = tipc_nl_compat_net_set; return tipc_nl_compat_doit(&doit, msg); case TIPC_CMD_SET_NETID: msg->req_type = TIPC_TLV_UNSIGNED; doit.doit = __tipc_nl_net_set; doit.transcode = tipc_nl_compat_net_set; return tipc_nl_compat_doit(&doit, msg); case TIPC_CMD_GET_NETID: msg->rep_size = sizeof(u32); dump.dumpit = tipc_nl_net_dump; dump.format = tipc_nl_compat_net_dump; return tipc_nl_compat_dumpit(&dump, msg); case TIPC_CMD_SHOW_STATS: return tipc_cmd_show_stats_compat(msg); } return -EOPNOTSUPP; } static int tipc_nl_compat_recv(struct sk_buff *skb, struct genl_info *info) { int err; int len; struct tipc_nl_compat_msg msg; struct nlmsghdr *req_nlh; struct nlmsghdr *rep_nlh; struct tipc_genlmsghdr *req_userhdr = genl_info_userhdr(info); memset(&msg, 0, sizeof(msg)); req_nlh = (struct nlmsghdr *)skb->data; msg.req = nlmsg_data(req_nlh) + GENL_HDRLEN + TIPC_GENL_HDRLEN; msg.cmd = req_userhdr->cmd; msg.net = genl_info_net(info); msg.dst_sk = skb->sk; if ((msg.cmd & 0xC000) && (!netlink_net_capable(skb, CAP_NET_ADMIN))) { msg.rep = tipc_get_err_tlv(TIPC_CFG_NOT_NET_ADMIN); err = -EACCES; goto send; } msg.req_size = nlmsg_attrlen(req_nlh, GENL_HDRLEN + TIPC_GENL_HDRLEN); if (msg.req_size && !TLV_OK(msg.req, msg.req_size)) { msg.rep = tipc_get_err_tlv(TIPC_CFG_NOT_SUPPORTED); err = -EOPNOTSUPP; goto send; } err = tipc_nl_compat_handle(&msg); if ((err == -EOPNOTSUPP) || (err == -EPERM)) msg.rep = tipc_get_err_tlv(TIPC_CFG_NOT_SUPPORTED); else if (err == -EINVAL) msg.rep = tipc_get_err_tlv(TIPC_CFG_TLV_ERROR); send: if (!msg.rep) return err; len = nlmsg_total_size(GENL_HDRLEN + TIPC_GENL_HDRLEN); skb_push(msg.rep, len); rep_nlh = nlmsg_hdr(msg.rep); memcpy(rep_nlh, info->nlhdr, len); rep_nlh->nlmsg_len = msg.rep->len; genlmsg_unicast(msg.net, msg.rep, NETLINK_CB(skb).portid); return err; } static const struct genl_small_ops tipc_genl_compat_ops[] = { { .cmd = TIPC_GENL_CMD, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = tipc_nl_compat_recv, }, }; static struct genl_family tipc_genl_compat_family __ro_after_init = { .name = TIPC_GENL_NAME, .version = TIPC_GENL_VERSION, .hdrsize = TIPC_GENL_HDRLEN, .maxattr = 0, .netnsok = true, .module = THIS_MODULE, .small_ops = tipc_genl_compat_ops, .n_small_ops = ARRAY_SIZE(tipc_genl_compat_ops), .resv_start_op = TIPC_GENL_CMD + 1, }; int __init tipc_netlink_compat_start(void) { int res; res = genl_register_family(&tipc_genl_compat_family); if (res) { pr_err("Failed to register legacy compat interface\n"); return res; } return 0; } void tipc_netlink_compat_stop(void) { genl_unregister_family(&tipc_genl_compat_family); } |
| 19661 68 6 19529 74 74 65 10 73 48 37 122 58 58 58 148 6 1 61 1 58 159 159 148 60 159 1 9 9 7 3 486 486 10 19527 19537 19520 19480 63 74 13 19527 4 19509 19525 19527 2 2 19524 19517 19534 264 256 9 9 265 19472 19469 2 2 19520 218 19520 264 19471 19513 9392 9393 9389 1 9389 5060 5061 5007 5006 1 1 4 1 4 4 4 1 3 4 191 188 19528 19522 72 72 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 | // SPDX-License-Identifier: GPL-2.0-or-later /* audit.c -- Auditing support * Gateway between the kernel (e.g., selinux) and the user-space audit daemon. * System-call specific features have moved to auditsc.c * * Copyright 2003-2007 Red Hat Inc., Durham, North Carolina. * All Rights Reserved. * * Written by Rickard E. (Rik) Faith <faith@redhat.com> * * Goals: 1) Integrate fully with Security Modules. * 2) Minimal run-time overhead: * a) Minimal when syscall auditing is disabled (audit_enable=0). * b) Small when syscall auditing is enabled and no audit record * is generated (defer as much work as possible to record * generation time): * i) context is allocated, * ii) names from getname are stored without a copy, and * iii) inode information stored from path_lookup. * 3) Ability to disable syscall auditing at boot time (audit=0). * 4) Usable by other parts of the kernel (if audit_log* is called, * then a syscall record will be generated automatically for the * current syscall). * 5) Netlink interface to user-space. * 6) Support low-overhead kernel-based filtering to minimize the * information that must be passed to user-space. * * Audit userspace, documentation, tests, and bug/issue trackers: * https://github.com/linux-audit */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/file.h> #include <linux/init.h> #include <linux/types.h> #include <linux/atomic.h> #include <linux/mm.h> #include <linux/export.h> #include <linux/slab.h> #include <linux/err.h> #include <linux/kthread.h> #include <linux/kernel.h> #include <linux/syscalls.h> #include <linux/spinlock.h> #include <linux/rcupdate.h> #include <linux/mutex.h> #include <linux/gfp.h> #include <linux/pid.h> #include <linux/audit.h> #include <net/sock.h> #include <net/netlink.h> #include <linux/skbuff.h> #include <linux/security.h> #include <linux/freezer.h> #include <linux/pid_namespace.h> #include <net/netns/generic.h> #include "audit.h" /* No auditing will take place until audit_initialized == AUDIT_INITIALIZED. * (Initialization happens after skb_init is called.) */ #define AUDIT_DISABLED -1 #define AUDIT_UNINITIALIZED 0 #define AUDIT_INITIALIZED 1 static int audit_initialized = AUDIT_UNINITIALIZED; u32 audit_enabled = AUDIT_OFF; bool audit_ever_enabled = !!AUDIT_OFF; EXPORT_SYMBOL_GPL(audit_enabled); /* Default state when kernel boots without any parameters. */ static u32 audit_default = AUDIT_OFF; /* If auditing cannot proceed, audit_failure selects what happens. */ static u32 audit_failure = AUDIT_FAIL_PRINTK; /* private audit network namespace index */ static unsigned int audit_net_id; /** * struct audit_net - audit private network namespace data * @sk: communication socket */ struct audit_net { struct sock *sk; }; /** * struct auditd_connection - kernel/auditd connection state * @pid: auditd PID * @portid: netlink portid * @net: the associated network namespace * @rcu: RCU head * * Description: * This struct is RCU protected; you must either hold the RCU lock for reading * or the associated spinlock for writing. */ struct auditd_connection { struct pid *pid; u32 portid; struct net *net; struct rcu_head rcu; }; static struct auditd_connection __rcu *auditd_conn; static DEFINE_SPINLOCK(auditd_conn_lock); /* If audit_rate_limit is non-zero, limit the rate of sending audit records * to that number per second. This prevents DoS attacks, but results in * audit records being dropped. */ static u32 audit_rate_limit; /* Number of outstanding audit_buffers allowed. * When set to zero, this means unlimited. */ static u32 audit_backlog_limit = 64; #define AUDIT_BACKLOG_WAIT_TIME (60 * HZ) static u32 audit_backlog_wait_time = AUDIT_BACKLOG_WAIT_TIME; /* The identity of the user shutting down the audit system. */ static kuid_t audit_sig_uid = INVALID_UID; static pid_t audit_sig_pid = -1; static struct lsm_prop audit_sig_lsm; /* Records can be lost in several ways: 0) [suppressed in audit_alloc] 1) out of memory in audit_log_start [kmalloc of struct audit_buffer] 2) out of memory in audit_log_move [alloc_skb] 3) suppressed due to audit_rate_limit 4) suppressed due to audit_backlog_limit */ static atomic_t audit_lost = ATOMIC_INIT(0); /* Monotonically increasing sum of time the kernel has spent * waiting while the backlog limit is exceeded. */ static atomic_t audit_backlog_wait_time_actual = ATOMIC_INIT(0); /* Hash for inode-based rules */ struct list_head audit_inode_hash[AUDIT_INODE_BUCKETS]; static struct kmem_cache *audit_buffer_cache; /* queue msgs to send via kauditd_task */ static struct sk_buff_head audit_queue; /* queue msgs due to temporary unicast send problems */ static struct sk_buff_head audit_retry_queue; /* queue msgs waiting for new auditd connection */ static struct sk_buff_head audit_hold_queue; /* queue servicing thread */ static struct task_struct *kauditd_task; static DECLARE_WAIT_QUEUE_HEAD(kauditd_wait); /* waitqueue for callers who are blocked on the audit backlog */ static DECLARE_WAIT_QUEUE_HEAD(audit_backlog_wait); static struct audit_features af = {.vers = AUDIT_FEATURE_VERSION, .mask = -1, .features = 0, .lock = 0,}; static char *audit_feature_names[2] = { "only_unset_loginuid", "loginuid_immutable", }; /** * struct audit_ctl_mutex - serialize requests from userspace * @lock: the mutex used for locking * @owner: the task which owns the lock * * Description: * This is the lock struct used to ensure we only process userspace requests * in an orderly fashion. We can't simply use a mutex/lock here because we * need to track lock ownership so we don't end up blocking the lock owner in * audit_log_start() or similar. */ static struct audit_ctl_mutex { struct mutex lock; void *owner; } audit_cmd_mutex; /* AUDIT_BUFSIZ is the size of the temporary buffer used for formatting * audit records. Since printk uses a 1024 byte buffer, this buffer * should be at least that large. */ #define AUDIT_BUFSIZ 1024 /* The audit_buffer is used when formatting an audit record. The caller * locks briefly to get the record off the freelist or to allocate the * buffer, and locks briefly to send the buffer to the netlink layer or * to place it on a transmit queue. Multiple audit_buffers can be in * use simultaneously. */ struct audit_buffer { struct sk_buff *skb; /* formatted skb ready to send */ struct audit_context *ctx; /* NULL or associated context */ gfp_t gfp_mask; }; struct audit_reply { __u32 portid; struct net *net; struct sk_buff *skb; }; /** * auditd_test_task - Check to see if a given task is an audit daemon * @task: the task to check * * Description: * Return 1 if the task is a registered audit daemon, 0 otherwise. */ int auditd_test_task(struct task_struct *task) { int rc; struct auditd_connection *ac; rcu_read_lock(); ac = rcu_dereference(auditd_conn); rc = (ac && ac->pid == task_tgid(task) ? 1 : 0); rcu_read_unlock(); return rc; } /** * audit_ctl_lock - Take the audit control lock */ void audit_ctl_lock(void) { mutex_lock(&audit_cmd_mutex.lock); audit_cmd_mutex.owner = current; } /** * audit_ctl_unlock - Drop the audit control lock */ void audit_ctl_unlock(void) { audit_cmd_mutex.owner = NULL; mutex_unlock(&audit_cmd_mutex.lock); } /** * audit_ctl_owner_current - Test to see if the current task owns the lock * * Description: * Return true if the current task owns the audit control lock, false if it * doesn't own the lock. */ static bool audit_ctl_owner_current(void) { return (current == audit_cmd_mutex.owner); } /** * auditd_pid_vnr - Return the auditd PID relative to the namespace * * Description: * Returns the PID in relation to the namespace, 0 on failure. */ static pid_t auditd_pid_vnr(void) { pid_t pid; const struct auditd_connection *ac; rcu_read_lock(); ac = rcu_dereference(auditd_conn); if (!ac || !ac->pid) pid = 0; else pid = pid_vnr(ac->pid); rcu_read_unlock(); return pid; } /** * audit_get_sk - Return the audit socket for the given network namespace * @net: the destination network namespace * * Description: * Returns the sock pointer if valid, NULL otherwise. The caller must ensure * that a reference is held for the network namespace while the sock is in use. */ static struct sock *audit_get_sk(const struct net *net) { struct audit_net *aunet; if (!net) return NULL; aunet = net_generic(net, audit_net_id); return aunet->sk; } void audit_panic(const char *message) { switch (audit_failure) { case AUDIT_FAIL_SILENT: break; case AUDIT_FAIL_PRINTK: if (printk_ratelimit()) pr_err("%s\n", message); break; case AUDIT_FAIL_PANIC: panic("audit: %s\n", message); break; } } static inline int audit_rate_check(void) { static unsigned long last_check = 0; static int messages = 0; static DEFINE_SPINLOCK(lock); unsigned long flags; unsigned long now; int retval = 0; if (!audit_rate_limit) return 1; spin_lock_irqsave(&lock, flags); if (++messages < audit_rate_limit) { retval = 1; } else { now = jiffies; if (time_after(now, last_check + HZ)) { last_check = now; messages = 0; retval = 1; } } spin_unlock_irqrestore(&lock, flags); return retval; } /** * audit_log_lost - conditionally log lost audit message event * @message: the message stating reason for lost audit message * * Emit at least 1 message per second, even if audit_rate_check is * throttling. * Always increment the lost messages counter. */ void audit_log_lost(const char *message) { static unsigned long last_msg = 0; static DEFINE_SPINLOCK(lock); unsigned long flags; unsigned long now; int print; atomic_inc(&audit_lost); print = (audit_failure == AUDIT_FAIL_PANIC || !audit_rate_limit); if (!print) { spin_lock_irqsave(&lock, flags); now = jiffies; if (time_after(now, last_msg + HZ)) { print = 1; last_msg = now; } spin_unlock_irqrestore(&lock, flags); } if (print) { if (printk_ratelimit()) pr_warn("audit_lost=%u audit_rate_limit=%u audit_backlog_limit=%u\n", atomic_read(&audit_lost), audit_rate_limit, audit_backlog_limit); audit_panic(message); } } static int audit_log_config_change(char *function_name, u32 new, u32 old, int allow_changes) { struct audit_buffer *ab; int rc = 0; ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONFIG_CHANGE); if (unlikely(!ab)) return rc; audit_log_format(ab, "op=set %s=%u old=%u ", function_name, new, old); audit_log_session_info(ab); rc = audit_log_task_context(ab); if (rc) allow_changes = 0; /* Something weird, deny request */ audit_log_format(ab, " res=%d", allow_changes); audit_log_end(ab); return rc; } static int audit_do_config_change(char *function_name, u32 *to_change, u32 new) { int allow_changes, rc = 0; u32 old = *to_change; /* check if we are locked */ if (audit_enabled == AUDIT_LOCKED) allow_changes = 0; else allow_changes = 1; if (audit_enabled != AUDIT_OFF) { rc = audit_log_config_change(function_name, new, old, allow_changes); if (rc) allow_changes = 0; } /* If we are allowed, make the change */ if (allow_changes == 1) *to_change = new; /* Not allowed, update reason */ else if (rc == 0) rc = -EPERM; return rc; } static int audit_set_rate_limit(u32 limit) { return audit_do_config_change("audit_rate_limit", &audit_rate_limit, limit); } static int audit_set_backlog_limit(u32 limit) { return audit_do_config_change("audit_backlog_limit", &audit_backlog_limit, limit); } static int audit_set_backlog_wait_time(u32 timeout) { return audit_do_config_change("audit_backlog_wait_time", &audit_backlog_wait_time, timeout); } static int audit_set_enabled(u32 state) { int rc; if (state > AUDIT_LOCKED) return -EINVAL; rc = audit_do_config_change("audit_enabled", &audit_enabled, state); if (!rc) audit_ever_enabled |= !!state; return rc; } static int audit_set_failure(u32 state) { if (state != AUDIT_FAIL_SILENT && state != AUDIT_FAIL_PRINTK && state != AUDIT_FAIL_PANIC) return -EINVAL; return audit_do_config_change("audit_failure", &audit_failure, state); } /** * auditd_conn_free - RCU helper to release an auditd connection struct * @rcu: RCU head * * Description: * Drop any references inside the auditd connection tracking struct and free * the memory. */ static void auditd_conn_free(struct rcu_head *rcu) { struct auditd_connection *ac; ac = container_of(rcu, struct auditd_connection, rcu); put_pid(ac->pid); put_net(ac->net); kfree(ac); } /** * auditd_set - Set/Reset the auditd connection state * @pid: auditd PID * @portid: auditd netlink portid * @net: auditd network namespace pointer * @skb: the netlink command from the audit daemon * @ack: netlink ack flag, cleared if ack'd here * * Description: * This function will obtain and drop network namespace references as * necessary. Returns zero on success, negative values on failure. */ static int auditd_set(struct pid *pid, u32 portid, struct net *net, struct sk_buff *skb, bool *ack) { unsigned long flags; struct auditd_connection *ac_old, *ac_new; struct nlmsghdr *nlh; if (!pid || !net) return -EINVAL; ac_new = kzalloc(sizeof(*ac_new), GFP_KERNEL); if (!ac_new) return -ENOMEM; ac_new->pid = get_pid(pid); ac_new->portid = portid; ac_new->net = get_net(net); /* send the ack now to avoid a race with the queue backlog */ if (*ack) { nlh = nlmsg_hdr(skb); netlink_ack(skb, nlh, 0, NULL); *ack = false; } spin_lock_irqsave(&auditd_conn_lock, flags); ac_old = rcu_dereference_protected(auditd_conn, lockdep_is_held(&auditd_conn_lock)); rcu_assign_pointer(auditd_conn, ac_new); spin_unlock_irqrestore(&auditd_conn_lock, flags); if (ac_old) call_rcu(&ac_old->rcu, auditd_conn_free); return 0; } /** * kauditd_printk_skb - Print the audit record to the ring buffer * @skb: audit record * * Whatever the reason, this packet may not make it to the auditd connection * so write it via printk so the information isn't completely lost. */ static void kauditd_printk_skb(struct sk_buff *skb) { struct nlmsghdr *nlh = nlmsg_hdr(skb); char *data = nlmsg_data(nlh); if (nlh->nlmsg_type != AUDIT_EOE && printk_ratelimit()) pr_notice("type=%d %s\n", nlh->nlmsg_type, data); } /** * kauditd_rehold_skb - Handle a audit record send failure in the hold queue * @skb: audit record * @error: error code (unused) * * Description: * This should only be used by the kauditd_thread when it fails to flush the * hold queue. */ static void kauditd_rehold_skb(struct sk_buff *skb, __always_unused int error) { /* put the record back in the queue */ skb_queue_tail(&audit_hold_queue, skb); } /** * kauditd_hold_skb - Queue an audit record, waiting for auditd * @skb: audit record * @error: error code * * Description: * Queue the audit record, waiting for an instance of auditd. When this * function is called we haven't given up yet on sending the record, but things * are not looking good. The first thing we want to do is try to write the * record via printk and then see if we want to try and hold on to the record * and queue it, if we have room. If we want to hold on to the record, but we * don't have room, record a record lost message. */ static void kauditd_hold_skb(struct sk_buff *skb, int error) { /* at this point it is uncertain if we will ever send this to auditd so * try to send the message via printk before we go any further */ kauditd_printk_skb(skb); /* can we just silently drop the message? */ if (!audit_default) goto drop; /* the hold queue is only for when the daemon goes away completely, * not -EAGAIN failures; if we are in a -EAGAIN state requeue the * record on the retry queue unless it's full, in which case drop it */ if (error == -EAGAIN) { if (!audit_backlog_limit || skb_queue_len(&audit_retry_queue) < audit_backlog_limit) { skb_queue_tail(&audit_retry_queue, skb); return; } audit_log_lost("kauditd retry queue overflow"); goto drop; } /* if we have room in the hold queue, queue the message */ if (!audit_backlog_limit || skb_queue_len(&audit_hold_queue) < audit_backlog_limit) { skb_queue_tail(&audit_hold_queue, skb); return; } /* we have no other options - drop the message */ audit_log_lost("kauditd hold queue overflow"); drop: kfree_skb(skb); } /** * kauditd_retry_skb - Queue an audit record, attempt to send again to auditd * @skb: audit record * @error: error code (unused) * * Description: * Not as serious as kauditd_hold_skb() as we still have a connected auditd, * but for some reason we are having problems sending it audit records so * queue the given record and attempt to resend. */ static void kauditd_retry_skb(struct sk_buff *skb, __always_unused int error) { if (!audit_backlog_limit || skb_queue_len(&audit_retry_queue) < audit_backlog_limit) { skb_queue_tail(&audit_retry_queue, skb); return; } /* we have to drop the record, send it via printk as a last effort */ kauditd_printk_skb(skb); audit_log_lost("kauditd retry queue overflow"); kfree_skb(skb); } /** * auditd_reset - Disconnect the auditd connection * @ac: auditd connection state * * Description: * Break the auditd/kauditd connection and move all the queued records into the * hold queue in case auditd reconnects. It is important to note that the @ac * pointer should never be dereferenced inside this function as it may be NULL * or invalid, you can only compare the memory address! If @ac is NULL then * the connection will always be reset. */ static void auditd_reset(const struct auditd_connection *ac) { unsigned long flags; struct sk_buff *skb; struct auditd_connection *ac_old; /* if it isn't already broken, break the connection */ spin_lock_irqsave(&auditd_conn_lock, flags); ac_old = rcu_dereference_protected(auditd_conn, lockdep_is_held(&auditd_conn_lock)); if (ac && ac != ac_old) { /* someone already registered a new auditd connection */ spin_unlock_irqrestore(&auditd_conn_lock, flags); return; } rcu_assign_pointer(auditd_conn, NULL); spin_unlock_irqrestore(&auditd_conn_lock, flags); if (ac_old) call_rcu(&ac_old->rcu, auditd_conn_free); /* flush the retry queue to the hold queue, but don't touch the main * queue since we need to process that normally for multicast */ while ((skb = skb_dequeue(&audit_retry_queue))) kauditd_hold_skb(skb, -ECONNREFUSED); } /** * auditd_send_unicast_skb - Send a record via unicast to auditd * @skb: audit record * * Description: * Send a skb to the audit daemon, returns positive/zero values on success and * negative values on failure; in all cases the skb will be consumed by this * function. If the send results in -ECONNREFUSED the connection with auditd * will be reset. This function may sleep so callers should not hold any locks * where this would cause a problem. */ static int auditd_send_unicast_skb(struct sk_buff *skb) { int rc; u32 portid; struct net *net; struct sock *sk; struct auditd_connection *ac; /* NOTE: we can't call netlink_unicast while in the RCU section so * take a reference to the network namespace and grab local * copies of the namespace, the sock, and the portid; the * namespace and sock aren't going to go away while we hold a * reference and if the portid does become invalid after the RCU * section netlink_unicast() should safely return an error */ rcu_read_lock(); ac = rcu_dereference(auditd_conn); if (!ac) { rcu_read_unlock(); kfree_skb(skb); rc = -ECONNREFUSED; goto err; } net = get_net(ac->net); sk = audit_get_sk(net); portid = ac->portid; rcu_read_unlock(); rc = netlink_unicast(sk, skb, portid, 0); put_net(net); if (rc < 0) goto err; return rc; err: if (ac && rc == -ECONNREFUSED) auditd_reset(ac); return rc; } /** * kauditd_send_queue - Helper for kauditd_thread to flush skb queues * @sk: the sending sock * @portid: the netlink destination * @queue: the skb queue to process * @retry_limit: limit on number of netlink unicast failures * @skb_hook: per-skb hook for additional processing * @err_hook: hook called if the skb fails the netlink unicast send * * Description: * Run through the given queue and attempt to send the audit records to auditd, * returns zero on success, negative values on failure. It is up to the caller * to ensure that the @sk is valid for the duration of this function. * */ static int kauditd_send_queue(struct sock *sk, u32 portid, struct sk_buff_head *queue, unsigned int retry_limit, void (*skb_hook)(struct sk_buff *skb), void (*err_hook)(struct sk_buff *skb, int error)) { int rc = 0; struct sk_buff *skb = NULL; struct sk_buff *skb_tail; unsigned int failed = 0; /* NOTE: kauditd_thread takes care of all our locking, we just use * the netlink info passed to us (e.g. sk and portid) */ skb_tail = skb_peek_tail(queue); while ((skb != skb_tail) && (skb = skb_dequeue(queue))) { /* call the skb_hook for each skb we touch */ if (skb_hook) (*skb_hook)(skb); /* can we send to anyone via unicast? */ if (!sk) { if (err_hook) (*err_hook)(skb, -ECONNREFUSED); continue; } retry: /* grab an extra skb reference in case of error */ skb_get(skb); rc = netlink_unicast(sk, skb, portid, 0); if (rc < 0) { /* send failed - try a few times unless fatal error */ if (++failed >= retry_limit || rc == -ECONNREFUSED || rc == -EPERM) { sk = NULL; if (err_hook) (*err_hook)(skb, rc); if (rc == -EAGAIN) rc = 0; /* continue to drain the queue */ continue; } else goto retry; } else { /* skb sent - drop the extra reference and continue */ consume_skb(skb); failed = 0; } } return (rc >= 0 ? 0 : rc); } /* * kauditd_send_multicast_skb - Send a record to any multicast listeners * @skb: audit record * * Description: * Write a multicast message to anyone listening in the initial network * namespace. This function doesn't consume an skb as might be expected since * it has to copy it anyways. */ static void kauditd_send_multicast_skb(struct sk_buff *skb) { struct sk_buff *copy; struct sock *sock = audit_get_sk(&init_net); struct nlmsghdr *nlh; /* NOTE: we are not taking an additional reference for init_net since * we don't have to worry about it going away */ if (!netlink_has_listeners(sock, AUDIT_NLGRP_READLOG)) return; /* * The seemingly wasteful skb_copy() rather than bumping the refcount * using skb_get() is necessary because non-standard mods are made to * the skb by the original kaudit unicast socket send routine. The * existing auditd daemon assumes this breakage. Fixing this would * require co-ordinating a change in the established protocol between * the kaudit kernel subsystem and the auditd userspace code. There is * no reason for new multicast clients to continue with this * non-compliance. */ copy = skb_copy(skb, GFP_KERNEL); if (!copy) return; nlh = nlmsg_hdr(copy); nlh->nlmsg_len = skb->len; nlmsg_multicast(sock, copy, 0, AUDIT_NLGRP_READLOG, GFP_KERNEL); } /** * kauditd_thread - Worker thread to send audit records to userspace * @dummy: unused */ static int kauditd_thread(void *dummy) { int rc; u32 portid = 0; struct net *net = NULL; struct sock *sk = NULL; struct auditd_connection *ac; #define UNICAST_RETRIES 5 set_freezable(); while (!kthread_should_stop()) { /* NOTE: see the lock comments in auditd_send_unicast_skb() */ rcu_read_lock(); ac = rcu_dereference(auditd_conn); if (!ac) { rcu_read_unlock(); goto main_queue; } net = get_net(ac->net); sk = audit_get_sk(net); portid = ac->portid; rcu_read_unlock(); /* attempt to flush the hold queue */ rc = kauditd_send_queue(sk, portid, &audit_hold_queue, UNICAST_RETRIES, NULL, kauditd_rehold_skb); if (rc < 0) { sk = NULL; auditd_reset(ac); goto main_queue; } /* attempt to flush the retry queue */ rc = kauditd_send_queue(sk, portid, &audit_retry_queue, UNICAST_RETRIES, NULL, kauditd_hold_skb); if (rc < 0) { sk = NULL; auditd_reset(ac); goto main_queue; } main_queue: /* process the main queue - do the multicast send and attempt * unicast, dump failed record sends to the retry queue; if * sk == NULL due to previous failures we will just do the * multicast send and move the record to the hold queue */ rc = kauditd_send_queue(sk, portid, &audit_queue, 1, kauditd_send_multicast_skb, (sk ? kauditd_retry_skb : kauditd_hold_skb)); if (ac && rc < 0) auditd_reset(ac); sk = NULL; /* drop our netns reference, no auditd sends past this line */ if (net) { put_net(net); net = NULL; } /* we have processed all the queues so wake everyone */ wake_up(&audit_backlog_wait); /* NOTE: we want to wake up if there is anything on the queue, * regardless of if an auditd is connected, as we need to * do the multicast send and rotate records from the * main queue to the retry/hold queues */ wait_event_freezable(kauditd_wait, (skb_queue_len(&audit_queue) ? 1 : 0)); } return 0; } int audit_send_list_thread(void *_dest) { struct audit_netlink_list *dest = _dest; struct sk_buff *skb; struct sock *sk = audit_get_sk(dest->net); /* wait for parent to finish and send an ACK */ audit_ctl_lock(); audit_ctl_unlock(); while ((skb = __skb_dequeue(&dest->q)) != NULL) netlink_unicast(sk, skb, dest->portid, 0); put_net(dest->net); kfree(dest); return 0; } struct sk_buff *audit_make_reply(int seq, int type, int done, int multi, const void *payload, int size) { struct sk_buff *skb; struct nlmsghdr *nlh; void *data; int flags = multi ? NLM_F_MULTI : 0; int t = done ? NLMSG_DONE : type; skb = nlmsg_new(size, GFP_KERNEL); if (!skb) return NULL; nlh = nlmsg_put(skb, 0, seq, t, size, flags); if (!nlh) goto out_kfree_skb; data = nlmsg_data(nlh); memcpy(data, payload, size); return skb; out_kfree_skb: kfree_skb(skb); return NULL; } static void audit_free_reply(struct audit_reply *reply) { if (!reply) return; kfree_skb(reply->skb); if (reply->net) put_net(reply->net); kfree(reply); } static int audit_send_reply_thread(void *arg) { struct audit_reply *reply = (struct audit_reply *)arg; audit_ctl_lock(); audit_ctl_unlock(); /* Ignore failure. It'll only happen if the sender goes away, because our timeout is set to infinite. */ netlink_unicast(audit_get_sk(reply->net), reply->skb, reply->portid, 0); reply->skb = NULL; audit_free_reply(reply); return 0; } /** * audit_send_reply - send an audit reply message via netlink * @request_skb: skb of request we are replying to (used to target the reply) * @seq: sequence number * @type: audit message type * @done: done (last) flag * @multi: multi-part message flag * @payload: payload data * @size: payload size * * Allocates a skb, builds the netlink message, and sends it to the port id. */ static void audit_send_reply(struct sk_buff *request_skb, int seq, int type, int done, int multi, const void *payload, int size) { struct task_struct *tsk; struct audit_reply *reply; reply = kzalloc(sizeof(*reply), GFP_KERNEL); if (!reply) return; reply->skb = audit_make_reply(seq, type, done, multi, payload, size); if (!reply->skb) goto err; reply->net = get_net(sock_net(NETLINK_CB(request_skb).sk)); reply->portid = NETLINK_CB(request_skb).portid; tsk = kthread_run(audit_send_reply_thread, reply, "audit_send_reply"); if (IS_ERR(tsk)) goto err; return; err: audit_free_reply(reply); } /* * Check for appropriate CAP_AUDIT_ capabilities on incoming audit * control messages. */ static int audit_netlink_ok(struct sk_buff *skb, u16 msg_type) { int err = 0; /* Only support initial user namespace for now. */ /* * We return ECONNREFUSED because it tricks userspace into thinking * that audit was not configured into the kernel. Lots of users * configure their PAM stack (because that's what the distro does) * to reject login if unable to send messages to audit. If we return * ECONNREFUSED the PAM stack thinks the kernel does not have audit * configured in and will let login proceed. If we return EPERM * userspace will reject all logins. This should be removed when we * support non init namespaces!! */ if (current_user_ns() != &init_user_ns) return -ECONNREFUSED; switch (msg_type) { case AUDIT_LIST: case AUDIT_ADD: case AUDIT_DEL: return -EOPNOTSUPP; case AUDIT_GET: case AUDIT_SET: case AUDIT_GET_FEATURE: case AUDIT_SET_FEATURE: case AUDIT_LIST_RULES: case AUDIT_ADD_RULE: case AUDIT_DEL_RULE: case AUDIT_SIGNAL_INFO: case AUDIT_TTY_GET: case AUDIT_TTY_SET: case AUDIT_TRIM: case AUDIT_MAKE_EQUIV: /* Only support auditd and auditctl in initial pid namespace * for now. */ if (task_active_pid_ns(current) != &init_pid_ns) return -EPERM; if (!netlink_capable(skb, CAP_AUDIT_CONTROL)) err = -EPERM; break; case AUDIT_USER: case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG: case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2: if (!netlink_capable(skb, CAP_AUDIT_WRITE)) err = -EPERM; break; default: /* bad msg */ err = -EINVAL; } return err; } static void audit_log_common_recv_msg(struct audit_context *context, struct audit_buffer **ab, u16 msg_type) { uid_t uid = from_kuid(&init_user_ns, current_uid()); pid_t pid = task_tgid_nr(current); if (!audit_enabled && msg_type != AUDIT_USER_AVC) { *ab = NULL; return; } *ab = audit_log_start(context, GFP_KERNEL, msg_type); if (unlikely(!*ab)) return; audit_log_format(*ab, "pid=%d uid=%u ", pid, uid); audit_log_session_info(*ab); audit_log_task_context(*ab); } static inline void audit_log_user_recv_msg(struct audit_buffer **ab, u16 msg_type) { audit_log_common_recv_msg(NULL, ab, msg_type); } static int is_audit_feature_set(int i) { return af.features & AUDIT_FEATURE_TO_MASK(i); } static int audit_get_feature(struct sk_buff *skb) { u32 seq; seq = nlmsg_hdr(skb)->nlmsg_seq; audit_send_reply(skb, seq, AUDIT_GET_FEATURE, 0, 0, &af, sizeof(af)); return 0; } static void audit_log_feature_change(int which, u32 old_feature, u32 new_feature, u32 old_lock, u32 new_lock, int res) { struct audit_buffer *ab; if (audit_enabled == AUDIT_OFF) return; ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_FEATURE_CHANGE); if (!ab) return; audit_log_task_info(ab); audit_log_format(ab, " feature=%s old=%u new=%u old_lock=%u new_lock=%u res=%d", audit_feature_names[which], !!old_feature, !!new_feature, !!old_lock, !!new_lock, res); audit_log_end(ab); } static int audit_set_feature(struct audit_features *uaf) { int i; BUILD_BUG_ON(AUDIT_LAST_FEATURE + 1 > ARRAY_SIZE(audit_feature_names)); /* if there is ever a version 2 we should handle that here */ for (i = 0; i <= AUDIT_LAST_FEATURE; i++) { u32 feature = AUDIT_FEATURE_TO_MASK(i); u32 old_feature, new_feature, old_lock, new_lock; /* if we are not changing this feature, move along */ if (!(feature & uaf->mask)) continue; old_feature = af.features & feature; new_feature = uaf->features & feature; new_lock = (uaf->lock | af.lock) & feature; old_lock = af.lock & feature; /* are we changing a locked feature? */ if (old_lock && (new_feature != old_feature)) { audit_log_feature_change(i, old_feature, new_feature, old_lock, new_lock, 0); return -EPERM; } } /* nothing invalid, do the changes */ for (i = 0; i <= AUDIT_LAST_FEATURE; i++) { u32 feature = AUDIT_FEATURE_TO_MASK(i); u32 old_feature, new_feature, old_lock, new_lock; /* if we are not changing this feature, move along */ if (!(feature & uaf->mask)) continue; old_feature = af.features & feature; new_feature = uaf->features & feature; old_lock = af.lock & feature; new_lock = (uaf->lock | af.lock) & feature; if (new_feature != old_feature) audit_log_feature_change(i, old_feature, new_feature, old_lock, new_lock, 1); if (new_feature) af.features |= feature; else af.features &= ~feature; af.lock |= new_lock; } return 0; } static int audit_replace(struct pid *pid) { pid_t pvnr; struct sk_buff *skb; pvnr = pid_vnr(pid); skb = audit_make_reply(0, AUDIT_REPLACE, 0, 0, &pvnr, sizeof(pvnr)); if (!skb) return -ENOMEM; return auditd_send_unicast_skb(skb); } static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh, bool *ack) { u32 seq; void *data; int data_len; int err; struct audit_buffer *ab; u16 msg_type = nlh->nlmsg_type; struct audit_sig_info *sig_data; struct lsm_context lsmctx = { NULL, 0, 0 }; err = audit_netlink_ok(skb, msg_type); if (err) return err; seq = nlh->nlmsg_seq; data = nlmsg_data(nlh); data_len = nlmsg_len(nlh); switch (msg_type) { case AUDIT_GET: { struct audit_status s; memset(&s, 0, sizeof(s)); s.enabled = audit_enabled; s.failure = audit_failure; /* NOTE: use pid_vnr() so the PID is relative to the current * namespace */ s.pid = auditd_pid_vnr(); s.rate_limit = audit_rate_limit; s.backlog_limit = audit_backlog_limit; s.lost = atomic_read(&audit_lost); s.backlog = skb_queue_len(&audit_queue); s.feature_bitmap = AUDIT_FEATURE_BITMAP_ALL; s.backlog_wait_time = audit_backlog_wait_time; s.backlog_wait_time_actual = atomic_read(&audit_backlog_wait_time_actual); audit_send_reply(skb, seq, AUDIT_GET, 0, 0, &s, sizeof(s)); break; } case AUDIT_SET: { struct audit_status s; memset(&s, 0, sizeof(s)); /* guard against past and future API changes */ memcpy(&s, data, min_t(size_t, sizeof(s), data_len)); if (s.mask & AUDIT_STATUS_ENABLED) { err = audit_set_enabled(s.enabled); if (err < 0) return err; } if (s.mask & AUDIT_STATUS_FAILURE) { err = audit_set_failure(s.failure); if (err < 0) return err; } if (s.mask & AUDIT_STATUS_PID) { /* NOTE: we are using the vnr PID functions below * because the s.pid value is relative to the * namespace of the caller; at present this * doesn't matter much since you can really only * run auditd from the initial pid namespace, but * something to keep in mind if this changes */ pid_t new_pid = s.pid; pid_t auditd_pid; struct pid *req_pid = task_tgid(current); /* Sanity check - PID values must match. Setting * pid to 0 is how auditd ends auditing. */ if (new_pid && (new_pid != pid_vnr(req_pid))) return -EINVAL; /* test the auditd connection */ audit_replace(req_pid); auditd_pid = auditd_pid_vnr(); if (auditd_pid) { /* replacing a healthy auditd is not allowed */ if (new_pid) { audit_log_config_change("audit_pid", new_pid, auditd_pid, 0); return -EEXIST; } /* only current auditd can unregister itself */ if (pid_vnr(req_pid) != auditd_pid) { audit_log_config_change("audit_pid", new_pid, auditd_pid, 0); return -EACCES; } } if (new_pid) { /* register a new auditd connection */ err = auditd_set(req_pid, NETLINK_CB(skb).portid, sock_net(NETLINK_CB(skb).sk), skb, ack); if (audit_enabled != AUDIT_OFF) audit_log_config_change("audit_pid", new_pid, auditd_pid, err ? 0 : 1); if (err) return err; /* try to process any backlog */ wake_up_interruptible(&kauditd_wait); } else { if (audit_enabled != AUDIT_OFF) audit_log_config_change("audit_pid", new_pid, auditd_pid, 1); /* unregister the auditd connection */ auditd_reset(NULL); } } if (s.mask & AUDIT_STATUS_RATE_LIMIT) { err = audit_set_rate_limit(s.rate_limit); if (err < 0) return err; } if (s.mask & AUDIT_STATUS_BACKLOG_LIMIT) { err = audit_set_backlog_limit(s.backlog_limit); if (err < 0) return err; } if (s.mask & AUDIT_STATUS_BACKLOG_WAIT_TIME) { if (sizeof(s) > (size_t)nlh->nlmsg_len) return -EINVAL; if (s.backlog_wait_time > 10*AUDIT_BACKLOG_WAIT_TIME) return -EINVAL; err = audit_set_backlog_wait_time(s.backlog_wait_time); if (err < 0) return err; } if (s.mask == AUDIT_STATUS_LOST) { u32 lost = atomic_xchg(&audit_lost, 0); audit_log_config_change("lost", 0, lost, 1); return lost; } if (s.mask == AUDIT_STATUS_BACKLOG_WAIT_TIME_ACTUAL) { u32 actual = atomic_xchg(&audit_backlog_wait_time_actual, 0); audit_log_config_change("backlog_wait_time_actual", 0, actual, 1); return actual; } break; } case AUDIT_GET_FEATURE: err = audit_get_feature(skb); if (err) return err; break; case AUDIT_SET_FEATURE: if (data_len < sizeof(struct audit_features)) return -EINVAL; err = audit_set_feature(data); if (err) return err; break; case AUDIT_USER: case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG: case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2: if (!audit_enabled && msg_type != AUDIT_USER_AVC) return 0; /* exit early if there isn't at least one character to print */ if (data_len < 2) return -EINVAL; err = audit_filter(msg_type, AUDIT_FILTER_USER); if (err == 1) { /* match or error */ char *str = data; err = 0; if (msg_type == AUDIT_USER_TTY) { err = tty_audit_push(); if (err) break; } audit_log_user_recv_msg(&ab, msg_type); if (msg_type != AUDIT_USER_TTY) { /* ensure NULL termination */ str[data_len - 1] = '\0'; audit_log_format(ab, " msg='%.*s'", AUDIT_MESSAGE_TEXT_MAX, str); } else { audit_log_format(ab, " data="); if (str[data_len - 1] == '\0') data_len--; audit_log_n_untrustedstring(ab, str, data_len); } audit_log_end(ab); } break; case AUDIT_ADD_RULE: case AUDIT_DEL_RULE: if (data_len < sizeof(struct audit_rule_data)) return -EINVAL; if (audit_enabled == AUDIT_LOCKED) { audit_log_common_recv_msg(audit_context(), &ab, AUDIT_CONFIG_CHANGE); audit_log_format(ab, " op=%s audit_enabled=%d res=0", msg_type == AUDIT_ADD_RULE ? "add_rule" : "remove_rule", audit_enabled); audit_log_end(ab); return -EPERM; } err = audit_rule_change(msg_type, seq, data, data_len); break; case AUDIT_LIST_RULES: err = audit_list_rules_send(skb, seq); break; case AUDIT_TRIM: audit_trim_trees(); audit_log_common_recv_msg(audit_context(), &ab, AUDIT_CONFIG_CHANGE); audit_log_format(ab, " op=trim res=1"); audit_log_end(ab); break; case AUDIT_MAKE_EQUIV: { void *bufp = data; u32 sizes[2]; size_t msglen = data_len; char *old, *new; err = -EINVAL; if (msglen < 2 * sizeof(u32)) break; memcpy(sizes, bufp, 2 * sizeof(u32)); bufp += 2 * sizeof(u32); msglen -= 2 * sizeof(u32); old = audit_unpack_string(&bufp, &msglen, sizes[0]); if (IS_ERR(old)) { err = PTR_ERR(old); break; } new = audit_unpack_string(&bufp, &msglen, sizes[1]); if (IS_ERR(new)) { err = PTR_ERR(new); kfree(old); break; } /* OK, here comes... */ err = audit_tag_tree(old, new); audit_log_common_recv_msg(audit_context(), &ab, AUDIT_CONFIG_CHANGE); audit_log_format(ab, " op=make_equiv old="); audit_log_untrustedstring(ab, old); audit_log_format(ab, " new="); audit_log_untrustedstring(ab, new); audit_log_format(ab, " res=%d", !err); audit_log_end(ab); kfree(old); kfree(new); break; } case AUDIT_SIGNAL_INFO: if (lsmprop_is_set(&audit_sig_lsm)) { err = security_lsmprop_to_secctx(&audit_sig_lsm, &lsmctx); if (err < 0) return err; } sig_data = kmalloc(struct_size(sig_data, ctx, lsmctx.len), GFP_KERNEL); if (!sig_data) { if (lsmprop_is_set(&audit_sig_lsm)) security_release_secctx(&lsmctx); return -ENOMEM; } sig_data->uid = from_kuid(&init_user_ns, audit_sig_uid); sig_data->pid = audit_sig_pid; if (lsmprop_is_set(&audit_sig_lsm)) { memcpy(sig_data->ctx, lsmctx.context, lsmctx.len); security_release_secctx(&lsmctx); } audit_send_reply(skb, seq, AUDIT_SIGNAL_INFO, 0, 0, sig_data, struct_size(sig_data, ctx, lsmctx.len)); kfree(sig_data); break; case AUDIT_TTY_GET: { struct audit_tty_status s; unsigned int t; t = READ_ONCE(current->signal->audit_tty); s.enabled = t & AUDIT_TTY_ENABLE; s.log_passwd = !!(t & AUDIT_TTY_LOG_PASSWD); audit_send_reply(skb, seq, AUDIT_TTY_GET, 0, 0, &s, sizeof(s)); break; } case AUDIT_TTY_SET: { struct audit_tty_status s, old; struct audit_buffer *ab; unsigned int t; memset(&s, 0, sizeof(s)); /* guard against past and future API changes */ memcpy(&s, data, min_t(size_t, sizeof(s), data_len)); /* check if new data is valid */ if ((s.enabled != 0 && s.enabled != 1) || (s.log_passwd != 0 && s.log_passwd != 1)) err = -EINVAL; if (err) t = READ_ONCE(current->signal->audit_tty); else { t = s.enabled | (-s.log_passwd & AUDIT_TTY_LOG_PASSWD); t = xchg(¤t->signal->audit_tty, t); } old.enabled = t & AUDIT_TTY_ENABLE; old.log_passwd = !!(t & AUDIT_TTY_LOG_PASSWD); audit_log_common_recv_msg(audit_context(), &ab, AUDIT_CONFIG_CHANGE); audit_log_format(ab, " op=tty_set old-enabled=%d new-enabled=%d" " old-log_passwd=%d new-log_passwd=%d res=%d", old.enabled, s.enabled, old.log_passwd, s.log_passwd, !err); audit_log_end(ab); break; } default: err = -EINVAL; break; } return err < 0 ? err : 0; } /** * audit_receive - receive messages from a netlink control socket * @skb: the message buffer * * Parse the provided skb and deal with any messages that may be present, * malformed skbs are discarded. */ static void audit_receive(struct sk_buff *skb) { struct nlmsghdr *nlh; bool ack; /* * len MUST be signed for nlmsg_next to be able to dec it below 0 * if the nlmsg_len was not aligned */ int len; int err; nlh = nlmsg_hdr(skb); len = skb->len; audit_ctl_lock(); while (nlmsg_ok(nlh, len)) { ack = nlh->nlmsg_flags & NLM_F_ACK; err = audit_receive_msg(skb, nlh, &ack); /* send an ack if the user asked for one and audit_receive_msg * didn't already do it, or if there was an error. */ if (ack || err) netlink_ack(skb, nlh, err, NULL); nlh = nlmsg_next(nlh, &len); } audit_ctl_unlock(); /* can't block with the ctrl lock, so penalize the sender now */ if (audit_backlog_limit && (skb_queue_len(&audit_queue) > audit_backlog_limit)) { DECLARE_WAITQUEUE(wait, current); /* wake kauditd to try and flush the queue */ wake_up_interruptible(&kauditd_wait); add_wait_queue_exclusive(&audit_backlog_wait, &wait); set_current_state(TASK_UNINTERRUPTIBLE); schedule_timeout(audit_backlog_wait_time); remove_wait_queue(&audit_backlog_wait, &wait); } } /* Log information about who is connecting to the audit multicast socket */ static void audit_log_multicast(int group, const char *op, int err) { const struct cred *cred; struct tty_struct *tty; char comm[sizeof(current->comm)]; struct audit_buffer *ab; if (!audit_enabled) return; ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_EVENT_LISTENER); if (!ab) return; cred = current_cred(); tty = audit_get_tty(); audit_log_format(ab, "pid=%u uid=%u auid=%u tty=%s ses=%u", task_tgid_nr(current), from_kuid(&init_user_ns, cred->uid), from_kuid(&init_user_ns, audit_get_loginuid(current)), tty ? tty_name(tty) : "(none)", audit_get_sessionid(current)); audit_put_tty(tty); audit_log_task_context(ab); /* subj= */ audit_log_format(ab, " comm="); audit_log_untrustedstring(ab, get_task_comm(comm, current)); audit_log_d_path_exe(ab, current->mm); /* exe= */ audit_log_format(ab, " nl-mcgrp=%d op=%s res=%d", group, op, !err); audit_log_end(ab); } /* Run custom bind function on netlink socket group connect or bind requests. */ static int audit_multicast_bind(struct net *net, int group) { int err = 0; if (!capable(CAP_AUDIT_READ)) err = -EPERM; audit_log_multicast(group, "connect", err); return err; } static void audit_multicast_unbind(struct net *net, int group) { audit_log_multicast(group, "disconnect", 0); } static int __net_init audit_net_init(struct net *net) { struct netlink_kernel_cfg cfg = { .input = audit_receive, .bind = audit_multicast_bind, .unbind = audit_multicast_unbind, .flags = NL_CFG_F_NONROOT_RECV, .groups = AUDIT_NLGRP_MAX, }; struct audit_net *aunet = net_generic(net, audit_net_id); aunet->sk = netlink_kernel_create(net, NETLINK_AUDIT, &cfg); if (aunet->sk == NULL) { audit_panic("cannot initialize netlink socket in namespace"); return -ENOMEM; } /* limit the timeout in case auditd is blocked/stopped */ aunet->sk->sk_sndtimeo = HZ / 10; return 0; } static void __net_exit audit_net_exit(struct net *net) { struct audit_net *aunet = net_generic(net, audit_net_id); /* NOTE: you would think that we would want to check the auditd * connection and potentially reset it here if it lives in this * namespace, but since the auditd connection tracking struct holds a * reference to this namespace (see auditd_set()) we are only ever * going to get here after that connection has been released */ netlink_kernel_release(aunet->sk); } static struct pernet_operations audit_net_ops __net_initdata = { .init = audit_net_init, .exit = audit_net_exit, .id = &audit_net_id, .size = sizeof(struct audit_net), }; /* Initialize audit support at boot time. */ static int __init audit_init(void) { int i; if (audit_initialized == AUDIT_DISABLED) return 0; audit_buffer_cache = KMEM_CACHE(audit_buffer, SLAB_PANIC); skb_queue_head_init(&audit_queue); skb_queue_head_init(&audit_retry_queue); skb_queue_head_init(&audit_hold_queue); for (i = 0; i < AUDIT_INODE_BUCKETS; i++) INIT_LIST_HEAD(&audit_inode_hash[i]); mutex_init(&audit_cmd_mutex.lock); audit_cmd_mutex.owner = NULL; pr_info("initializing netlink subsys (%s)\n", str_enabled_disabled(audit_default)); register_pernet_subsys(&audit_net_ops); audit_initialized = AUDIT_INITIALIZED; kauditd_task = kthread_run(kauditd_thread, NULL, "kauditd"); if (IS_ERR(kauditd_task)) { int err = PTR_ERR(kauditd_task); panic("audit: failed to start the kauditd thread (%d)\n", err); } audit_log(NULL, GFP_KERNEL, AUDIT_KERNEL, "state=initialized audit_enabled=%u res=1", audit_enabled); return 0; } postcore_initcall(audit_init); /* * Process kernel command-line parameter at boot time. * audit={0|off} or audit={1|on}. */ static int __init audit_enable(char *str) { if (!strcasecmp(str, "off") || !strcmp(str, "0")) audit_default = AUDIT_OFF; else if (!strcasecmp(str, "on") || !strcmp(str, "1")) audit_default = AUDIT_ON; else { pr_err("audit: invalid 'audit' parameter value (%s)\n", str); audit_default = AUDIT_ON; } if (audit_default == AUDIT_OFF) audit_initialized = AUDIT_DISABLED; if (audit_set_enabled(audit_default)) pr_err("audit: error setting audit state (%d)\n", audit_default); pr_info("%s\n", audit_default ? "enabled (after initialization)" : "disabled (until reboot)"); return 1; } __setup("audit=", audit_enable); /* Process kernel command-line parameter at boot time. * audit_backlog_limit=<n> */ static int __init audit_backlog_limit_set(char *str) { u32 audit_backlog_limit_arg; pr_info("audit_backlog_limit: "); if (kstrtouint(str, 0, &audit_backlog_limit_arg)) { pr_cont("using default of %u, unable to parse %s\n", audit_backlog_limit, str); return 1; } audit_backlog_limit = audit_backlog_limit_arg; pr_cont("%d\n", audit_backlog_limit); return 1; } __setup("audit_backlog_limit=", audit_backlog_limit_set); static void audit_buffer_free(struct audit_buffer *ab) { if (!ab) return; kfree_skb(ab->skb); kmem_cache_free(audit_buffer_cache, ab); } static struct audit_buffer *audit_buffer_alloc(struct audit_context *ctx, gfp_t gfp_mask, int type) { struct audit_buffer *ab; ab = kmem_cache_alloc(audit_buffer_cache, gfp_mask); if (!ab) return NULL; ab->skb = nlmsg_new(AUDIT_BUFSIZ, gfp_mask); if (!ab->skb) goto err; if (!nlmsg_put(ab->skb, 0, 0, type, 0, 0)) goto err; ab->ctx = ctx; ab->gfp_mask = gfp_mask; return ab; err: audit_buffer_free(ab); return NULL; } /** * audit_serial - compute a serial number for the audit record * * Compute a serial number for the audit record. Audit records are * written to user-space as soon as they are generated, so a complete * audit record may be written in several pieces. The timestamp of the * record and this serial number are used by the user-space tools to * determine which pieces belong to the same audit record. The * (timestamp,serial) tuple is unique for each syscall and is live from * syscall entry to syscall exit. * * NOTE: Another possibility is to store the formatted records off the * audit context (for those records that have a context), and emit them * all at syscall exit. However, this could delay the reporting of * significant errors until syscall exit (or never, if the system * halts). */ unsigned int audit_serial(void) { static atomic_t serial = ATOMIC_INIT(0); return atomic_inc_return(&serial); } static inline void audit_get_stamp(struct audit_context *ctx, struct timespec64 *t, unsigned int *serial) { if (!ctx || !auditsc_get_stamp(ctx, t, serial)) { ktime_get_coarse_real_ts64(t); *serial = audit_serial(); } } /** * audit_log_start - obtain an audit buffer * @ctx: audit_context (may be NULL) * @gfp_mask: type of allocation * @type: audit message type * * Returns audit_buffer pointer on success or NULL on error. * * Obtain an audit buffer. This routine does locking to obtain the * audit buffer, but then no locking is required for calls to * audit_log_*format. If the task (ctx) is a task that is currently in a * syscall, then the syscall is marked as auditable and an audit record * will be written at syscall exit. If there is no associated task, then * task context (ctx) should be NULL. */ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask, int type) { struct audit_buffer *ab; struct timespec64 t; unsigned int serial; if (audit_initialized != AUDIT_INITIALIZED) return NULL; if (unlikely(!audit_filter(type, AUDIT_FILTER_EXCLUDE))) return NULL; /* NOTE: don't ever fail/sleep on these two conditions: * 1. auditd generated record - since we need auditd to drain the * queue; also, when we are checking for auditd, compare PIDs using * task_tgid_vnr() since auditd_pid is set in audit_receive_msg() * using a PID anchored in the caller's namespace * 2. generator holding the audit_cmd_mutex - we don't want to block * while holding the mutex, although we do penalize the sender * later in audit_receive() when it is safe to block */ if (!(auditd_test_task(current) || audit_ctl_owner_current())) { long stime = audit_backlog_wait_time; while (audit_backlog_limit && (skb_queue_len(&audit_queue) > audit_backlog_limit)) { /* wake kauditd to try and flush the queue */ wake_up_interruptible(&kauditd_wait); /* sleep if we are allowed and we haven't exhausted our * backlog wait limit */ if (gfpflags_allow_blocking(gfp_mask) && (stime > 0)) { long rtime = stime; DECLARE_WAITQUEUE(wait, current); add_wait_queue_exclusive(&audit_backlog_wait, &wait); set_current_state(TASK_UNINTERRUPTIBLE); stime = schedule_timeout(rtime); atomic_add(rtime - stime, &audit_backlog_wait_time_actual); remove_wait_queue(&audit_backlog_wait, &wait); } else { if (audit_rate_check() && printk_ratelimit()) pr_warn("audit_backlog=%d > audit_backlog_limit=%d\n", skb_queue_len(&audit_queue), audit_backlog_limit); audit_log_lost("backlog limit exceeded"); return NULL; } } } ab = audit_buffer_alloc(ctx, gfp_mask, type); if (!ab) { audit_log_lost("out of memory in audit_log_start"); return NULL; } audit_get_stamp(ab->ctx, &t, &serial); /* cancel dummy context to enable supporting records */ if (ctx) ctx->dummy = 0; audit_log_format(ab, "audit(%llu.%03lu:%u): ", (unsigned long long)t.tv_sec, t.tv_nsec/1000000, serial); return ab; } /** * audit_expand - expand skb in the audit buffer * @ab: audit_buffer * @extra: space to add at tail of the skb * * Returns 0 (no space) on failed expansion, or available space if * successful. */ static inline int audit_expand(struct audit_buffer *ab, int extra) { struct sk_buff *skb = ab->skb; int oldtail = skb_tailroom(skb); int ret = pskb_expand_head(skb, 0, extra, ab->gfp_mask); int newtail = skb_tailroom(skb); if (ret < 0) { audit_log_lost("out of memory in audit_expand"); return 0; } skb->truesize += newtail - oldtail; return newtail; } /* * Format an audit message into the audit buffer. If there isn't enough * room in the audit buffer, more room will be allocated and vsnprint * will be called a second time. Currently, we assume that a printk * can't format message larger than 1024 bytes, so we don't either. */ static __printf(2, 0) void audit_log_vformat(struct audit_buffer *ab, const char *fmt, va_list args) { int len, avail; struct sk_buff *skb; va_list args2; if (!ab) return; BUG_ON(!ab->skb); skb = ab->skb; avail = skb_tailroom(skb); if (avail == 0) { avail = audit_expand(ab, AUDIT_BUFSIZ); if (!avail) goto out; } va_copy(args2, args); len = vsnprintf(skb_tail_pointer(skb), avail, fmt, args); if (len >= avail) { /* The printk buffer is 1024 bytes long, so if we get * here and AUDIT_BUFSIZ is at least 1024, then we can * log everything that printk could have logged. */ avail = audit_expand(ab, max_t(unsigned, AUDIT_BUFSIZ, 1+len-avail)); if (!avail) goto out_va_end; len = vsnprintf(skb_tail_pointer(skb), avail, fmt, args2); } if (len > 0) skb_put(skb, len); out_va_end: va_end(args2); out: return; } /** * audit_log_format - format a message into the audit buffer. * @ab: audit_buffer * @fmt: format string * @...: optional parameters matching @fmt string * * All the work is done in audit_log_vformat. */ void audit_log_format(struct audit_buffer *ab, const char *fmt, ...) { va_list args; if (!ab) return; va_start(args, fmt); audit_log_vformat(ab, fmt, args); va_end(args); } /** * audit_log_n_hex - convert a buffer to hex and append it to the audit skb * @ab: the audit_buffer * @buf: buffer to convert to hex * @len: length of @buf to be converted * * No return value; failure to expand is silently ignored. * * This function will take the passed buf and convert it into a string of * ascii hex digits. The new string is placed onto the skb. */ void audit_log_n_hex(struct audit_buffer *ab, const unsigned char *buf, size_t len) { int i, avail, new_len; unsigned char *ptr; struct sk_buff *skb; if (!ab) return; BUG_ON(!ab->skb); skb = ab->skb; avail = skb_tailroom(skb); new_len = len<<1; if (new_len >= avail) { /* Round the buffer request up to the next multiple */ new_len = AUDIT_BUFSIZ*(((new_len-avail)/AUDIT_BUFSIZ) + 1); avail = audit_expand(ab, new_len); if (!avail) return; } ptr = skb_tail_pointer(skb); for (i = 0; i < len; i++) ptr = hex_byte_pack_upper(ptr, buf[i]); *ptr = 0; skb_put(skb, len << 1); /* new string is twice the old string */ } /* * Format a string of no more than slen characters into the audit buffer, * enclosed in quote marks. */ void audit_log_n_string(struct audit_buffer *ab, const char *string, size_t slen) { int avail, new_len; unsigned char *ptr; struct sk_buff *skb; if (!ab) return; BUG_ON(!ab->skb); skb = ab->skb; avail = skb_tailroom(skb); new_len = slen + 3; /* enclosing quotes + null terminator */ if (new_len > avail) { avail = audit_expand(ab, new_len); if (!avail) return; } ptr = skb_tail_pointer(skb); *ptr++ = '"'; memcpy(ptr, string, slen); ptr += slen; *ptr++ = '"'; *ptr = 0; skb_put(skb, slen + 2); /* don't include null terminator */ } /** * audit_string_contains_control - does a string need to be logged in hex * @string: string to be checked * @len: max length of the string to check */ bool audit_string_contains_control(const char *string, size_t len) { const unsigned char *p; for (p = string; p < (const unsigned char *)string + len; p++) { if (*p == '"' || *p < 0x21 || *p > 0x7e) return true; } return false; } /** * audit_log_n_untrustedstring - log a string that may contain random characters * @ab: audit_buffer * @string: string to be logged * @len: length of string (not including trailing null) * * This code will escape a string that is passed to it if the string * contains a control character, unprintable character, double quote mark, * or a space. Unescaped strings will start and end with a double quote mark. * Strings that are escaped are printed in hex (2 digits per char). * * The caller specifies the number of characters in the string to log, which may * or may not be the entire string. */ void audit_log_n_untrustedstring(struct audit_buffer *ab, const char *string, size_t len) { if (audit_string_contains_control(string, len)) audit_log_n_hex(ab, string, len); else audit_log_n_string(ab, string, len); } /** * audit_log_untrustedstring - log a string that may contain random characters * @ab: audit_buffer * @string: string to be logged * * Same as audit_log_n_untrustedstring(), except that strlen is used to * determine string length. */ void audit_log_untrustedstring(struct audit_buffer *ab, const char *string) { audit_log_n_untrustedstring(ab, string, strlen(string)); } /* This is a helper-function to print the escaped d_path */ void audit_log_d_path(struct audit_buffer *ab, const char *prefix, const struct path *path) { char *p, *pathname; if (prefix) audit_log_format(ab, "%s", prefix); /* We will allow 11 spaces for ' (deleted)' to be appended */ pathname = kmalloc(PATH_MAX+11, ab->gfp_mask); if (!pathname) { audit_log_format(ab, "\"<no_memory>\""); return; } p = d_path(path, pathname, PATH_MAX+11); if (IS_ERR(p)) { /* Should never happen since we send PATH_MAX */ /* FIXME: can we save some information here? */ audit_log_format(ab, "\"<too_long>\""); } else audit_log_untrustedstring(ab, p); kfree(pathname); } void audit_log_session_info(struct audit_buffer *ab) { unsigned int sessionid = audit_get_sessionid(current); uid_t auid = from_kuid(&init_user_ns, audit_get_loginuid(current)); audit_log_format(ab, "auid=%u ses=%u", auid, sessionid); } void audit_log_key(struct audit_buffer *ab, char *key) { audit_log_format(ab, " key="); if (key) audit_log_untrustedstring(ab, key); else audit_log_format(ab, "(null)"); } int audit_log_task_context(struct audit_buffer *ab) { struct lsm_prop prop; struct lsm_context ctx; int error; security_current_getlsmprop_subj(&prop); if (!lsmprop_is_set(&prop)) return 0; error = security_lsmprop_to_secctx(&prop, &ctx); if (error < 0) { if (error != -EINVAL) goto error_path; return 0; } audit_log_format(ab, " subj=%s", ctx.context); security_release_secctx(&ctx); return 0; error_path: audit_panic("error in audit_log_task_context"); return error; } EXPORT_SYMBOL(audit_log_task_context); void audit_log_d_path_exe(struct audit_buffer *ab, struct mm_struct *mm) { struct file *exe_file; if (!mm) goto out_null; exe_file = get_mm_exe_file(mm); if (!exe_file) goto out_null; audit_log_d_path(ab, " exe=", &exe_file->f_path); fput(exe_file); return; out_null: audit_log_format(ab, " exe=(null)"); } struct tty_struct *audit_get_tty(void) { struct tty_struct *tty = NULL; unsigned long flags; spin_lock_irqsave(¤t->sighand->siglock, flags); if (current->signal) tty = tty_kref_get(current->signal->tty); spin_unlock_irqrestore(¤t->sighand->siglock, flags); return tty; } void audit_put_tty(struct tty_struct *tty) { tty_kref_put(tty); } void audit_log_task_info(struct audit_buffer *ab) { const struct cred *cred; char comm[sizeof(current->comm)]; struct tty_struct *tty; if (!ab) return; cred = current_cred(); tty = audit_get_tty(); audit_log_format(ab, " ppid=%d pid=%d auid=%u uid=%u gid=%u" " euid=%u suid=%u fsuid=%u" " egid=%u sgid=%u fsgid=%u tty=%s ses=%u", task_ppid_nr(current), task_tgid_nr(current), from_kuid(&init_user_ns, audit_get_loginuid(current)), from_kuid(&init_user_ns, cred->uid), from_kgid(&init_user_ns, cred->gid), from_kuid(&init_user_ns, cred->euid), from_kuid(&init_user_ns, cred->suid), from_kuid(&init_user_ns, cred->fsuid), from_kgid(&init_user_ns, cred->egid), from_kgid(&init_user_ns, cred->sgid), from_kgid(&init_user_ns, cred->fsgid), tty ? tty_name(tty) : "(none)", audit_get_sessionid(current)); audit_put_tty(tty); audit_log_format(ab, " comm="); audit_log_untrustedstring(ab, get_task_comm(comm, current)); audit_log_d_path_exe(ab, current->mm); audit_log_task_context(ab); } EXPORT_SYMBOL(audit_log_task_info); /** * audit_log_path_denied - report a path restriction denial * @type: audit message type (AUDIT_ANOM_LINK, AUDIT_ANOM_CREAT, etc) * @operation: specific operation name */ void audit_log_path_denied(int type, const char *operation) { struct audit_buffer *ab; if (!audit_enabled) return; /* Generate log with subject, operation, outcome. */ ab = audit_log_start(audit_context(), GFP_KERNEL, type); if (!ab) return; audit_log_format(ab, "op=%s", operation); audit_log_task_info(ab); audit_log_format(ab, " res=0"); audit_log_end(ab); } /* global counter which is incremented every time something logs in */ static atomic_t session_id = ATOMIC_INIT(0); static int audit_set_loginuid_perm(kuid_t loginuid) { /* if we are unset, we don't need privs */ if (!audit_loginuid_set(current)) return 0; /* if AUDIT_FEATURE_LOGINUID_IMMUTABLE means never ever allow a change*/ if (is_audit_feature_set(AUDIT_FEATURE_LOGINUID_IMMUTABLE)) return -EPERM; /* it is set, you need permission */ if (!capable(CAP_AUDIT_CONTROL)) return -EPERM; /* reject if this is not an unset and we don't allow that */ if (is_audit_feature_set(AUDIT_FEATURE_ONLY_UNSET_LOGINUID) && uid_valid(loginuid)) return -EPERM; return 0; } static void audit_log_set_loginuid(kuid_t koldloginuid, kuid_t kloginuid, unsigned int oldsessionid, unsigned int sessionid, int rc) { struct audit_buffer *ab; uid_t uid, oldloginuid, loginuid; struct tty_struct *tty; if (!audit_enabled) return; ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_LOGIN); if (!ab) return; uid = from_kuid(&init_user_ns, task_uid(current)); oldloginuid = from_kuid(&init_user_ns, koldloginuid); loginuid = from_kuid(&init_user_ns, kloginuid); tty = audit_get_tty(); audit_log_format(ab, "pid=%d uid=%u", task_tgid_nr(current), uid); audit_log_task_context(ab); audit_log_format(ab, " old-auid=%u auid=%u tty=%s old-ses=%u ses=%u res=%d", oldloginuid, loginuid, tty ? tty_name(tty) : "(none)", oldsessionid, sessionid, !rc); audit_put_tty(tty); audit_log_end(ab); } /** * audit_set_loginuid - set current task's loginuid * @loginuid: loginuid value * * Returns 0. * * Called (set) from fs/proc/base.c::proc_loginuid_write(). */ int audit_set_loginuid(kuid_t loginuid) { unsigned int oldsessionid, sessionid = AUDIT_SID_UNSET; kuid_t oldloginuid; int rc; oldloginuid = audit_get_loginuid(current); oldsessionid = audit_get_sessionid(current); rc = audit_set_loginuid_perm(loginuid); if (rc) goto out; /* are we setting or clearing? */ if (uid_valid(loginuid)) { sessionid = (unsigned int)atomic_inc_return(&session_id); if (unlikely(sessionid == AUDIT_SID_UNSET)) sessionid = (unsigned int)atomic_inc_return(&session_id); } current->sessionid = sessionid; current->loginuid = loginuid; out: audit_log_set_loginuid(oldloginuid, loginuid, oldsessionid, sessionid, rc); return rc; } /** * audit_signal_info - record signal info for shutting down audit subsystem * @sig: signal value * @t: task being signaled * * If the audit subsystem is being terminated, record the task (pid) * and uid that is doing that. */ int audit_signal_info(int sig, struct task_struct *t) { kuid_t uid = current_uid(), auid; if (auditd_test_task(t) && (sig == SIGTERM || sig == SIGHUP || sig == SIGUSR1 || sig == SIGUSR2)) { audit_sig_pid = task_tgid_nr(current); auid = audit_get_loginuid(current); if (uid_valid(auid)) audit_sig_uid = auid; else audit_sig_uid = uid; security_current_getlsmprop_subj(&audit_sig_lsm); } return audit_signal_info_syscall(t); } /** * audit_log_end - end one audit record * @ab: the audit_buffer * * We can not do a netlink send inside an irq context because it blocks (last * arg, flags, is not set to MSG_DONTWAIT), so the audit buffer is placed on a * queue and a kthread is scheduled to remove them from the queue outside the * irq context. May be called in any context. */ void audit_log_end(struct audit_buffer *ab) { struct sk_buff *skb; struct nlmsghdr *nlh; if (!ab) return; if (audit_rate_check()) { skb = ab->skb; ab->skb = NULL; /* setup the netlink header, see the comments in * kauditd_send_multicast_skb() for length quirks */ nlh = nlmsg_hdr(skb); nlh->nlmsg_len = skb->len - NLMSG_HDRLEN; /* queue the netlink packet and poke the kauditd thread */ skb_queue_tail(&audit_queue, skb); wake_up_interruptible(&kauditd_wait); } else audit_log_lost("rate limit exceeded"); audit_buffer_free(ab); } /** * audit_log - Log an audit record * @ctx: audit context * @gfp_mask: type of allocation * @type: audit message type * @fmt: format string to use * @...: variable parameters matching the format string * * This is a convenience function that calls audit_log_start, * audit_log_vformat, and audit_log_end. It may be called * in any context. */ void audit_log(struct audit_context *ctx, gfp_t gfp_mask, int type, const char *fmt, ...) { struct audit_buffer *ab; va_list args; ab = audit_log_start(ctx, gfp_mask, type); if (ab) { va_start(args, fmt); audit_log_vformat(ab, fmt, args); va_end(args); audit_log_end(ab); } } EXPORT_SYMBOL(audit_log_start); EXPORT_SYMBOL(audit_log_end); EXPORT_SYMBOL(audit_log_format); EXPORT_SYMBOL(audit_log); |
| 39 521 52 92 193 72 709 279 140 2 307 7 437 340 328 2 145 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 | /* * net/tipc/trace.h: TIPC tracepoints * * Copyright (c) 2018, Ericsson AB * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the names of the copyright holders nor the names of its * contributors may be used to endorse or promote products derived from * this software without specific prior written permission. * * Alternatively, this software may be distributed under the terms of the * GNU General Public License ("GPL") version 2 as published by the Free * Software Foundation. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "ASIS" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. */ #undef TRACE_SYSTEM #define TRACE_SYSTEM tipc #if !defined(_TIPC_TRACE_H) || defined(TRACE_HEADER_MULTI_READ) #define _TIPC_TRACE_H #include <linux/tracepoint.h> #include "core.h" #include "link.h" #include "socket.h" #include "node.h" #define SKB_LMIN (100) #define SKB_LMAX (SKB_LMIN * 2) #define LIST_LMIN (SKB_LMIN * 3) #define LIST_LMAX (SKB_LMIN * 11) #define SK_LMIN (SKB_LMIN * 2) #define SK_LMAX (SKB_LMIN * 11) #define LINK_LMIN (SKB_LMIN) #define LINK_LMAX (SKB_LMIN * 16) #define NODE_LMIN (SKB_LMIN) #define NODE_LMAX (SKB_LMIN * 11) #ifndef __TIPC_TRACE_ENUM #define __TIPC_TRACE_ENUM enum { TIPC_DUMP_NONE = 0, TIPC_DUMP_TRANSMQ = 1, TIPC_DUMP_BACKLOGQ = (1 << 1), TIPC_DUMP_DEFERDQ = (1 << 2), TIPC_DUMP_INPUTQ = (1 << 3), TIPC_DUMP_WAKEUP = (1 << 4), TIPC_DUMP_SK_SNDQ = (1 << 8), TIPC_DUMP_SK_RCVQ = (1 << 9), TIPC_DUMP_SK_BKLGQ = (1 << 10), TIPC_DUMP_ALL = 0xffffu }; #endif /* Link & Node FSM states: */ #define state_sym(val) \ __print_symbolic(val, \ {(0xe), "ESTABLISHED" },\ {(0xe << 4), "ESTABLISHING" },\ {(0x1 << 8), "RESET" },\ {(0x2 << 12), "RESETTING" },\ {(0xd << 16), "PEER_RESET" },\ {(0xf << 20), "FAILINGOVER" },\ {(0xc << 24), "SYNCHING" },\ {(0xdd), "SELF_DOWN_PEER_DOWN" },\ {(0xaa), "SELF_UP_PEER_UP" },\ {(0xd1), "SELF_DOWN_PEER_LEAVING" },\ {(0xac), "SELF_UP_PEER_COMING" },\ {(0xca), "SELF_COMING_PEER_UP" },\ {(0x1d), "SELF_LEAVING_PEER_DOWN" },\ {(0xf0), "FAILINGOVER" },\ {(0xcc), "SYNCHING" }) /* Link & Node FSM events: */ #define evt_sym(val) \ __print_symbolic(val, \ {(0xec1ab1e), "ESTABLISH_EVT" },\ {(0x9eed0e), "PEER_RESET_EVT" },\ {(0xfa110e), "FAILURE_EVT" },\ {(0x10ca1d0e), "RESET_EVT" },\ {(0xfa110bee), "FAILOVER_BEGIN_EVT" },\ {(0xfa110ede), "FAILOVER_END_EVT" },\ {(0xc1ccbee), "SYNCH_BEGIN_EVT" },\ {(0xc1ccede), "SYNCH_END_EVT" },\ {(0xece), "SELF_ESTABL_CONTACT_EVT" },\ {(0x1ce), "SELF_LOST_CONTACT_EVT" },\ {(0x9ece), "PEER_ESTABL_CONTACT_EVT" },\ {(0x91ce), "PEER_LOST_CONTACT_EVT" },\ {(0xfbe), "FAILOVER_BEGIN_EVT" },\ {(0xfee), "FAILOVER_END_EVT" },\ {(0xcbe), "SYNCH_BEGIN_EVT" },\ {(0xcee), "SYNCH_END_EVT" }) /* Bearer, net device events: */ #define dev_evt_sym(val) \ __print_symbolic(val, \ {(NETDEV_CHANGE), "NETDEV_CHANGE" },\ {(NETDEV_GOING_DOWN), "NETDEV_GOING_DOWN" },\ {(NETDEV_UP), "NETDEV_UP" },\ {(NETDEV_CHANGEMTU), "NETDEV_CHANGEMTU" },\ {(NETDEV_CHANGEADDR), "NETDEV_CHANGEADDR" },\ {(NETDEV_UNREGISTER), "NETDEV_UNREGISTER" },\ {(NETDEV_CHANGENAME), "NETDEV_CHANGENAME" }) extern unsigned long sysctl_tipc_sk_filter[5] __read_mostly; int tipc_skb_dump(struct sk_buff *skb, bool more, char *buf); int tipc_list_dump(struct sk_buff_head *list, bool more, char *buf); int tipc_sk_dump(struct sock *sk, u16 dqueues, char *buf); int tipc_link_dump(struct tipc_link *l, u16 dqueues, char *buf); int tipc_node_dump(struct tipc_node *n, bool more, char *buf); bool tipc_sk_filtering(struct sock *sk); DECLARE_EVENT_CLASS(tipc_skb_class, TP_PROTO(struct sk_buff *skb, bool more, const char *header), TP_ARGS(skb, more, header), TP_STRUCT__entry( __string(header, header) __dynamic_array(char, buf, (more) ? SKB_LMAX : SKB_LMIN) ), TP_fast_assign( __assign_str(header); tipc_skb_dump(skb, more, __get_str(buf)); ), TP_printk("%s\n%s", __get_str(header), __get_str(buf)) ) #define DEFINE_SKB_EVENT(name) \ DEFINE_EVENT(tipc_skb_class, name, \ TP_PROTO(struct sk_buff *skb, bool more, const char *header), \ TP_ARGS(skb, more, header)) DEFINE_SKB_EVENT(tipc_skb_dump); DEFINE_SKB_EVENT(tipc_proto_build); DEFINE_SKB_EVENT(tipc_proto_rcv); DECLARE_EVENT_CLASS(tipc_list_class, TP_PROTO(struct sk_buff_head *list, bool more, const char *header), TP_ARGS(list, more, header), TP_STRUCT__entry( __string(header, header) __dynamic_array(char, buf, (more) ? LIST_LMAX : LIST_LMIN) ), TP_fast_assign( __assign_str(header); tipc_list_dump(list, more, __get_str(buf)); ), TP_printk("%s\n%s", __get_str(header), __get_str(buf)) ); #define DEFINE_LIST_EVENT(name) \ DEFINE_EVENT(tipc_list_class, name, \ TP_PROTO(struct sk_buff_head *list, bool more, const char *header), \ TP_ARGS(list, more, header)) DEFINE_LIST_EVENT(tipc_list_dump); DECLARE_EVENT_CLASS(tipc_sk_class, TP_PROTO(struct sock *sk, struct sk_buff *skb, u16 dqueues, const char *header), TP_ARGS(sk, skb, dqueues, header), TP_STRUCT__entry( __string(header, header) __field(u32, portid) __dynamic_array(char, buf, (dqueues) ? SK_LMAX : SK_LMIN) __dynamic_array(char, skb_buf, (skb) ? SKB_LMIN : 1) ), TP_fast_assign( __assign_str(header); __entry->portid = tipc_sock_get_portid(sk); tipc_sk_dump(sk, dqueues, __get_str(buf)); if (skb) tipc_skb_dump(skb, false, __get_str(skb_buf)); else *(__get_str(skb_buf)) = '\0'; ), TP_printk("<%u> %s\n%s%s", __entry->portid, __get_str(header), __get_str(skb_buf), __get_str(buf)) ); #define DEFINE_SK_EVENT_FILTER(name) \ DEFINE_EVENT_CONDITION(tipc_sk_class, name, \ TP_PROTO(struct sock *sk, struct sk_buff *skb, u16 dqueues, \ const char *header), \ TP_ARGS(sk, skb, dqueues, header), \ TP_CONDITION(tipc_sk_filtering(sk))) DEFINE_SK_EVENT_FILTER(tipc_sk_dump); DEFINE_SK_EVENT_FILTER(tipc_sk_create); DEFINE_SK_EVENT_FILTER(tipc_sk_sendmcast); DEFINE_SK_EVENT_FILTER(tipc_sk_sendmsg); DEFINE_SK_EVENT_FILTER(tipc_sk_sendstream); DEFINE_SK_EVENT_FILTER(tipc_sk_poll); DEFINE_SK_EVENT_FILTER(tipc_sk_filter_rcv); DEFINE_SK_EVENT_FILTER(tipc_sk_advance_rx); DEFINE_SK_EVENT_FILTER(tipc_sk_rej_msg); DEFINE_SK_EVENT_FILTER(tipc_sk_drop_msg); DEFINE_SK_EVENT_FILTER(tipc_sk_release); DEFINE_SK_EVENT_FILTER(tipc_sk_shutdown); #define DEFINE_SK_EVENT_FILTER_COND(name, cond) \ DEFINE_EVENT_CONDITION(tipc_sk_class, name, \ TP_PROTO(struct sock *sk, struct sk_buff *skb, u16 dqueues, \ const char *header), \ TP_ARGS(sk, skb, dqueues, header), \ TP_CONDITION(tipc_sk_filtering(sk) && (cond))) DEFINE_SK_EVENT_FILTER_COND(tipc_sk_overlimit1, tipc_sk_overlimit1(sk, skb)); DEFINE_SK_EVENT_FILTER_COND(tipc_sk_overlimit2, tipc_sk_overlimit2(sk, skb)); DECLARE_EVENT_CLASS(tipc_link_class, TP_PROTO(struct tipc_link *l, u16 dqueues, const char *header), TP_ARGS(l, dqueues, header), TP_STRUCT__entry( __string(header, header) __array(char, name, TIPC_MAX_LINK_NAME) __dynamic_array(char, buf, (dqueues) ? LINK_LMAX : LINK_LMIN) ), TP_fast_assign( __assign_str(header); memcpy(__entry->name, tipc_link_name(l), TIPC_MAX_LINK_NAME); tipc_link_dump(l, dqueues, __get_str(buf)); ), TP_printk("<%s> %s\n%s", __entry->name, __get_str(header), __get_str(buf)) ); #define DEFINE_LINK_EVENT(name) \ DEFINE_EVENT(tipc_link_class, name, \ TP_PROTO(struct tipc_link *l, u16 dqueues, const char *header), \ TP_ARGS(l, dqueues, header)) DEFINE_LINK_EVENT(tipc_link_dump); DEFINE_LINK_EVENT(tipc_link_conges); DEFINE_LINK_EVENT(tipc_link_timeout); DEFINE_LINK_EVENT(tipc_link_reset); #define DEFINE_LINK_EVENT_COND(name, cond) \ DEFINE_EVENT_CONDITION(tipc_link_class, name, \ TP_PROTO(struct tipc_link *l, u16 dqueues, const char *header), \ TP_ARGS(l, dqueues, header), \ TP_CONDITION(cond)) DEFINE_LINK_EVENT_COND(tipc_link_too_silent, tipc_link_too_silent(l)); DECLARE_EVENT_CLASS(tipc_link_transmq_class, TP_PROTO(struct tipc_link *r, u16 f, u16 t, struct sk_buff_head *tq), TP_ARGS(r, f, t, tq), TP_STRUCT__entry( __array(char, name, TIPC_MAX_LINK_NAME) __field(u16, from) __field(u16, to) __field(u32, len) __field(u16, fseqno) __field(u16, lseqno) ), TP_fast_assign( memcpy(__entry->name, tipc_link_name(r), TIPC_MAX_LINK_NAME); __entry->from = f; __entry->to = t; __entry->len = skb_queue_len(tq); __entry->fseqno = __entry->len ? msg_seqno(buf_msg(skb_peek(tq))) : 0; __entry->lseqno = __entry->len ? msg_seqno(buf_msg(skb_peek_tail(tq))) : 0; ), TP_printk("<%s> retrans req: [%u-%u] transmq: %u [%u-%u]\n", __entry->name, __entry->from, __entry->to, __entry->len, __entry->fseqno, __entry->lseqno) ); DEFINE_EVENT_CONDITION(tipc_link_transmq_class, tipc_link_retrans, TP_PROTO(struct tipc_link *r, u16 f, u16 t, struct sk_buff_head *tq), TP_ARGS(r, f, t, tq), TP_CONDITION(less_eq(f, t)) ); DEFINE_EVENT_PRINT(tipc_link_transmq_class, tipc_link_bc_ack, TP_PROTO(struct tipc_link *r, u16 f, u16 t, struct sk_buff_head *tq), TP_ARGS(r, f, t, tq), TP_printk("<%s> acked: %u gap: %u transmq: %u [%u-%u]\n", __entry->name, __entry->from, __entry->to, __entry->len, __entry->fseqno, __entry->lseqno) ); DECLARE_EVENT_CLASS(tipc_node_class, TP_PROTO(struct tipc_node *n, bool more, const char *header), TP_ARGS(n, more, header), TP_STRUCT__entry( __string(header, header) __field(u32, addr) __dynamic_array(char, buf, (more) ? NODE_LMAX : NODE_LMIN) ), TP_fast_assign( __assign_str(header); __entry->addr = tipc_node_get_addr(n); tipc_node_dump(n, more, __get_str(buf)); ), TP_printk("<%x> %s\n%s", __entry->addr, __get_str(header), __get_str(buf)) ); #define DEFINE_NODE_EVENT(name) \ DEFINE_EVENT(tipc_node_class, name, \ TP_PROTO(struct tipc_node *n, bool more, const char *header), \ TP_ARGS(n, more, header)) DEFINE_NODE_EVENT(tipc_node_dump); DEFINE_NODE_EVENT(tipc_node_create); DEFINE_NODE_EVENT(tipc_node_delete); DEFINE_NODE_EVENT(tipc_node_lost_contact); DEFINE_NODE_EVENT(tipc_node_timeout); DEFINE_NODE_EVENT(tipc_node_link_up); DEFINE_NODE_EVENT(tipc_node_link_down); DEFINE_NODE_EVENT(tipc_node_reset_links); DEFINE_NODE_EVENT(tipc_node_check_state); DECLARE_EVENT_CLASS(tipc_fsm_class, TP_PROTO(const char *name, u32 os, u32 ns, int evt), TP_ARGS(name, os, ns, evt), TP_STRUCT__entry( __string(name, name) __field(u32, os) __field(u32, ns) __field(u32, evt) ), TP_fast_assign( __assign_str(name); __entry->os = os; __entry->ns = ns; __entry->evt = evt; ), TP_printk("<%s> %s--(%s)->%s\n", __get_str(name), state_sym(__entry->os), evt_sym(__entry->evt), state_sym(__entry->ns)) ); #define DEFINE_FSM_EVENT(fsm_name) \ DEFINE_EVENT(tipc_fsm_class, fsm_name, \ TP_PROTO(const char *name, u32 os, u32 ns, int evt), \ TP_ARGS(name, os, ns, evt)) DEFINE_FSM_EVENT(tipc_link_fsm); DEFINE_FSM_EVENT(tipc_node_fsm); TRACE_EVENT(tipc_l2_device_event, TP_PROTO(struct net_device *dev, struct tipc_bearer *b, unsigned long evt), TP_ARGS(dev, b, evt), TP_STRUCT__entry( __string(dev_name, dev->name) __string(b_name, b->name) __field(unsigned long, evt) __field(u8, b_up) __field(u8, carrier) __field(u8, oper) ), TP_fast_assign( __assign_str(dev_name); __assign_str(b_name); __entry->evt = evt; __entry->b_up = test_bit(0, &b->up); __entry->carrier = netif_carrier_ok(dev); __entry->oper = netif_oper_up(dev); ), TP_printk("%s on: <%s>/<%s> oper: %s carrier: %s bearer: %s\n", dev_evt_sym(__entry->evt), __get_str(dev_name), __get_str(b_name), (__entry->oper) ? "up" : "down", (__entry->carrier) ? "ok" : "notok", (__entry->b_up) ? "up" : "down") ); #endif /* _TIPC_TRACE_H */ /* This part must be outside protection */ #undef TRACE_INCLUDE_PATH #define TRACE_INCLUDE_PATH . #undef TRACE_INCLUDE_FILE #define TRACE_INCLUDE_FILE trace #include <trace/define_trace.h> |
| 31 31 57 73 26 22 4 23 23 23 16 47 47 47 52 35 35 35 6 5 1 1 5 43 16 27 1 28 28 28 28 28 4 15 11 1 15 12 27 3 1 1 37 10 4 6 5 4 2 2 4 6 3 2 73 73 47 26 81 40 47 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 41 1 41 19 21 27 25 2 27 7 23 53 53 1 23 31 27 31 17 14 14 27 4 11 2 9 7 6 2 4 90 45 52 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Copyright(c) 1999 - 2004 Intel Corporation. All rights reserved. */ #include <linux/skbuff.h> #include <linux/netdevice.h> #include <linux/etherdevice.h> #include <linux/pkt_sched.h> #include <linux/spinlock.h> #include <linux/slab.h> #include <linux/timer.h> #include <linux/ip.h> #include <linux/ipv6.h> #include <linux/if_arp.h> #include <linux/if_ether.h> #include <linux/if_bonding.h> #include <linux/if_vlan.h> #include <linux/in.h> #include <net/arp.h> #include <net/ipv6.h> #include <net/ndisc.h> #include <asm/byteorder.h> #include <net/bonding.h> #include <net/bond_alb.h> static const u8 mac_v6_allmcast[ETH_ALEN + 2] __long_aligned = { 0x33, 0x33, 0x00, 0x00, 0x00, 0x01 }; static const int alb_delta_in_ticks = HZ / ALB_TIMER_TICKS_PER_SEC; #pragma pack(1) struct learning_pkt { u8 mac_dst[ETH_ALEN]; u8 mac_src[ETH_ALEN]; __be16 type; u8 padding[ETH_ZLEN - ETH_HLEN]; }; struct arp_pkt { __be16 hw_addr_space; __be16 prot_addr_space; u8 hw_addr_len; u8 prot_addr_len; __be16 op_code; u8 mac_src[ETH_ALEN]; /* sender hardware address */ __be32 ip_src; /* sender IP address */ u8 mac_dst[ETH_ALEN]; /* target hardware address */ __be32 ip_dst; /* target IP address */ }; #pragma pack() /* Forward declaration */ static void alb_send_learning_packets(struct slave *slave, const u8 mac_addr[], bool strict_match); static void rlb_purge_src_ip(struct bonding *bond, struct arp_pkt *arp); static void rlb_src_unlink(struct bonding *bond, u32 index); static void rlb_src_link(struct bonding *bond, u32 ip_src_hash, u32 ip_dst_hash); static inline u8 _simple_hash(const u8 *hash_start, int hash_size) { int i; u8 hash = 0; for (i = 0; i < hash_size; i++) hash ^= hash_start[i]; return hash; } /*********************** tlb specific functions ***************************/ static inline void tlb_init_table_entry(struct tlb_client_info *entry, int save_load) { if (save_load) { entry->load_history = 1 + entry->tx_bytes / BOND_TLB_REBALANCE_INTERVAL; entry->tx_bytes = 0; } entry->tx_slave = NULL; entry->next = TLB_NULL_INDEX; entry->prev = TLB_NULL_INDEX; } static inline void tlb_init_slave(struct slave *slave) { SLAVE_TLB_INFO(slave).load = 0; SLAVE_TLB_INFO(slave).head = TLB_NULL_INDEX; } static void __tlb_clear_slave(struct bonding *bond, struct slave *slave, int save_load) { struct tlb_client_info *tx_hash_table; u32 index; /* clear slave from tx_hashtbl */ tx_hash_table = BOND_ALB_INFO(bond).tx_hashtbl; /* skip this if we've already freed the tx hash table */ if (tx_hash_table) { index = SLAVE_TLB_INFO(slave).head; while (index != TLB_NULL_INDEX) { u32 next_index = tx_hash_table[index].next; tlb_init_table_entry(&tx_hash_table[index], save_load); index = next_index; } } tlb_init_slave(slave); } static void tlb_clear_slave(struct bonding *bond, struct slave *slave, int save_load) { spin_lock_bh(&bond->mode_lock); __tlb_clear_slave(bond, slave, save_load); spin_unlock_bh(&bond->mode_lock); } /* Must be called before starting the monitor timer */ static int tlb_initialize(struct bonding *bond) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); int size = TLB_HASH_TABLE_SIZE * sizeof(struct tlb_client_info); struct tlb_client_info *new_hashtbl; int i; new_hashtbl = kzalloc(size, GFP_KERNEL); if (!new_hashtbl) return -ENOMEM; spin_lock_bh(&bond->mode_lock); bond_info->tx_hashtbl = new_hashtbl; for (i = 0; i < TLB_HASH_TABLE_SIZE; i++) tlb_init_table_entry(&bond_info->tx_hashtbl[i], 0); spin_unlock_bh(&bond->mode_lock); return 0; } /* Must be called only after all slaves have been released */ static void tlb_deinitialize(struct bonding *bond) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); spin_lock_bh(&bond->mode_lock); kfree(bond_info->tx_hashtbl); bond_info->tx_hashtbl = NULL; spin_unlock_bh(&bond->mode_lock); } static long long compute_gap(struct slave *slave) { return (s64) (slave->speed << 20) - /* Convert to Megabit per sec */ (s64) (SLAVE_TLB_INFO(slave).load << 3); /* Bytes to bits */ } static struct slave *tlb_get_least_loaded_slave(struct bonding *bond) { struct slave *slave, *least_loaded; struct list_head *iter; long long max_gap; least_loaded = NULL; max_gap = LLONG_MIN; /* Find the slave with the largest gap */ bond_for_each_slave_rcu(bond, slave, iter) { if (bond_slave_can_tx(slave)) { long long gap = compute_gap(slave); if (max_gap < gap) { least_loaded = slave; max_gap = gap; } } } return least_loaded; } static struct slave *__tlb_choose_channel(struct bonding *bond, u32 hash_index, u32 skb_len) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); struct tlb_client_info *hash_table; struct slave *assigned_slave; hash_table = bond_info->tx_hashtbl; assigned_slave = hash_table[hash_index].tx_slave; if (!assigned_slave) { assigned_slave = tlb_get_least_loaded_slave(bond); if (assigned_slave) { struct tlb_slave_info *slave_info = &(SLAVE_TLB_INFO(assigned_slave)); u32 next_index = slave_info->head; hash_table[hash_index].tx_slave = assigned_slave; hash_table[hash_index].next = next_index; hash_table[hash_index].prev = TLB_NULL_INDEX; if (next_index != TLB_NULL_INDEX) hash_table[next_index].prev = hash_index; slave_info->head = hash_index; slave_info->load += hash_table[hash_index].load_history; } } if (assigned_slave) hash_table[hash_index].tx_bytes += skb_len; return assigned_slave; } static struct slave *tlb_choose_channel(struct bonding *bond, u32 hash_index, u32 skb_len) { struct slave *tx_slave; /* We don't need to disable softirq here, because * tlb_choose_channel() is only called by bond_alb_xmit() * which already has softirq disabled. */ spin_lock(&bond->mode_lock); tx_slave = __tlb_choose_channel(bond, hash_index, skb_len); spin_unlock(&bond->mode_lock); return tx_slave; } /*********************** rlb specific functions ***************************/ /* when an ARP REPLY is received from a client update its info * in the rx_hashtbl */ static void rlb_update_entry_from_arp(struct bonding *bond, struct arp_pkt *arp) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); struct rlb_client_info *client_info; u32 hash_index; spin_lock_bh(&bond->mode_lock); hash_index = _simple_hash((u8 *)&(arp->ip_src), sizeof(arp->ip_src)); client_info = &(bond_info->rx_hashtbl[hash_index]); if ((client_info->assigned) && (client_info->ip_src == arp->ip_dst) && (client_info->ip_dst == arp->ip_src) && (!ether_addr_equal_64bits(client_info->mac_dst, arp->mac_src))) { /* update the clients MAC address */ ether_addr_copy(client_info->mac_dst, arp->mac_src); client_info->ntt = 1; bond_info->rx_ntt = 1; } spin_unlock_bh(&bond->mode_lock); } static int rlb_arp_recv(const struct sk_buff *skb, struct bonding *bond, struct slave *slave) { struct arp_pkt *arp, _arp; if (skb->protocol != cpu_to_be16(ETH_P_ARP)) goto out; arp = skb_header_pointer(skb, 0, sizeof(_arp), &_arp); if (!arp) goto out; /* We received an ARP from arp->ip_src. * We might have used this IP address previously (on the bonding host * itself or on a system that is bridged together with the bond). * However, if arp->mac_src is different than what is stored in * rx_hashtbl, some other host is now using the IP and we must prevent * sending out client updates with this IP address and the old MAC * address. * Clean up all hash table entries that have this address as ip_src but * have a different mac_src. */ rlb_purge_src_ip(bond, arp); if (arp->op_code == htons(ARPOP_REPLY)) { /* update rx hash table for this ARP */ rlb_update_entry_from_arp(bond, arp); slave_dbg(bond->dev, slave->dev, "Server received an ARP Reply from client\n"); } out: return RX_HANDLER_ANOTHER; } /* Caller must hold rcu_read_lock() */ static struct slave *__rlb_next_rx_slave(struct bonding *bond) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); struct slave *before = NULL, *rx_slave = NULL, *slave; struct list_head *iter; bool found = false; bond_for_each_slave_rcu(bond, slave, iter) { if (!bond_slave_can_tx(slave)) continue; if (!found) { if (!before || before->speed < slave->speed) before = slave; } else { if (!rx_slave || rx_slave->speed < slave->speed) rx_slave = slave; } if (slave == bond_info->rx_slave) found = true; } /* we didn't find anything after the current or we have something * better before and up to the current slave */ if (!rx_slave || (before && rx_slave->speed < before->speed)) rx_slave = before; if (rx_slave) bond_info->rx_slave = rx_slave; return rx_slave; } /* Caller must hold RTNL, rcu_read_lock is obtained only to silence checkers */ static struct slave *rlb_next_rx_slave(struct bonding *bond) { struct slave *rx_slave; ASSERT_RTNL(); rcu_read_lock(); rx_slave = __rlb_next_rx_slave(bond); rcu_read_unlock(); return rx_slave; } /* teach the switch the mac of a disabled slave * on the primary for fault tolerance * * Caller must hold RTNL */ static void rlb_teach_disabled_mac_on_primary(struct bonding *bond, const u8 addr[]) { struct slave *curr_active = rtnl_dereference(bond->curr_active_slave); if (!curr_active) return; if (!bond->alb_info.primary_is_promisc) { if (!dev_set_promiscuity(curr_active->dev, 1)) bond->alb_info.primary_is_promisc = 1; else bond->alb_info.primary_is_promisc = 0; } bond->alb_info.rlb_promisc_timeout_counter = 0; alb_send_learning_packets(curr_active, addr, true); } /* slave being removed should not be active at this point * * Caller must hold rtnl. */ static void rlb_clear_slave(struct bonding *bond, struct slave *slave) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); struct rlb_client_info *rx_hash_table; u32 index, next_index; /* clear slave from rx_hashtbl */ spin_lock_bh(&bond->mode_lock); rx_hash_table = bond_info->rx_hashtbl; index = bond_info->rx_hashtbl_used_head; for (; index != RLB_NULL_INDEX; index = next_index) { next_index = rx_hash_table[index].used_next; if (rx_hash_table[index].slave == slave) { struct slave *assigned_slave = rlb_next_rx_slave(bond); if (assigned_slave) { rx_hash_table[index].slave = assigned_slave; if (is_valid_ether_addr(rx_hash_table[index].mac_dst)) { bond_info->rx_hashtbl[index].ntt = 1; bond_info->rx_ntt = 1; /* A slave has been removed from the * table because it is either disabled * or being released. We must retry the * update to avoid clients from not * being updated & disconnecting when * there is stress */ bond_info->rlb_update_retry_counter = RLB_UPDATE_RETRY; } } else { /* there is no active slave */ rx_hash_table[index].slave = NULL; } } } spin_unlock_bh(&bond->mode_lock); if (slave != rtnl_dereference(bond->curr_active_slave)) rlb_teach_disabled_mac_on_primary(bond, slave->dev->dev_addr); } static void rlb_update_client(struct rlb_client_info *client_info) { int i; if (!client_info->slave || !is_valid_ether_addr(client_info->mac_dst)) return; for (i = 0; i < RLB_ARP_BURST_SIZE; i++) { struct sk_buff *skb; skb = arp_create(ARPOP_REPLY, ETH_P_ARP, client_info->ip_dst, client_info->slave->dev, client_info->ip_src, client_info->mac_dst, client_info->slave->dev->dev_addr, client_info->mac_dst); if (!skb) { slave_err(client_info->slave->bond->dev, client_info->slave->dev, "failed to create an ARP packet\n"); continue; } skb->dev = client_info->slave->dev; if (client_info->vlan_id) { __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), client_info->vlan_id); } arp_xmit(skb); } } /* sends ARP REPLIES that update the clients that need updating */ static void rlb_update_rx_clients(struct bonding *bond) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); struct rlb_client_info *client_info; u32 hash_index; spin_lock_bh(&bond->mode_lock); hash_index = bond_info->rx_hashtbl_used_head; for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) { client_info = &(bond_info->rx_hashtbl[hash_index]); if (client_info->ntt) { rlb_update_client(client_info); if (bond_info->rlb_update_retry_counter == 0) client_info->ntt = 0; } } /* do not update the entries again until this counter is zero so that * not to confuse the clients. */ bond_info->rlb_update_delay_counter = RLB_UPDATE_DELAY; spin_unlock_bh(&bond->mode_lock); } /* The slave was assigned a new mac address - update the clients */ static void rlb_req_update_slave_clients(struct bonding *bond, struct slave *slave) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); struct rlb_client_info *client_info; int ntt = 0; u32 hash_index; spin_lock_bh(&bond->mode_lock); hash_index = bond_info->rx_hashtbl_used_head; for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) { client_info = &(bond_info->rx_hashtbl[hash_index]); if ((client_info->slave == slave) && is_valid_ether_addr(client_info->mac_dst)) { client_info->ntt = 1; ntt = 1; } } /* update the team's flag only after the whole iteration */ if (ntt) { bond_info->rx_ntt = 1; /* fasten the change */ bond_info->rlb_update_retry_counter = RLB_UPDATE_RETRY; } spin_unlock_bh(&bond->mode_lock); } /* mark all clients using src_ip to be updated */ static void rlb_req_update_subnet_clients(struct bonding *bond, __be32 src_ip) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); struct rlb_client_info *client_info; u32 hash_index; spin_lock(&bond->mode_lock); hash_index = bond_info->rx_hashtbl_used_head; for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) { client_info = &(bond_info->rx_hashtbl[hash_index]); if (!client_info->slave) { netdev_err(bond->dev, "found a client with no channel in the client's hash table\n"); continue; } /* update all clients using this src_ip, that are not assigned * to the team's address (curr_active_slave) and have a known * unicast mac address. */ if ((client_info->ip_src == src_ip) && !ether_addr_equal_64bits(client_info->slave->dev->dev_addr, bond->dev->dev_addr) && is_valid_ether_addr(client_info->mac_dst)) { client_info->ntt = 1; bond_info->rx_ntt = 1; } } spin_unlock(&bond->mode_lock); } static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bond, const struct arp_pkt *arp) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); struct slave *assigned_slave, *curr_active_slave; struct rlb_client_info *client_info; u32 hash_index = 0; spin_lock(&bond->mode_lock); curr_active_slave = rcu_dereference(bond->curr_active_slave); hash_index = _simple_hash((u8 *)&arp->ip_dst, sizeof(arp->ip_dst)); client_info = &(bond_info->rx_hashtbl[hash_index]); if (client_info->assigned) { if ((client_info->ip_src == arp->ip_src) && (client_info->ip_dst == arp->ip_dst)) { /* the entry is already assigned to this client */ if (!is_broadcast_ether_addr(arp->mac_dst)) { /* update mac address from arp */ ether_addr_copy(client_info->mac_dst, arp->mac_dst); } ether_addr_copy(client_info->mac_src, arp->mac_src); assigned_slave = client_info->slave; if (assigned_slave) { spin_unlock(&bond->mode_lock); return assigned_slave; } } else { /* the entry is already assigned to some other client, * move the old client to primary (curr_active_slave) so * that the new client can be assigned to this entry. */ if (curr_active_slave && client_info->slave != curr_active_slave) { client_info->slave = curr_active_slave; rlb_update_client(client_info); } } } /* assign a new slave */ assigned_slave = __rlb_next_rx_slave(bond); if (assigned_slave) { if (!(client_info->assigned && client_info->ip_src == arp->ip_src)) { /* ip_src is going to be updated, * fix the src hash list */ u32 hash_src = _simple_hash((u8 *)&arp->ip_src, sizeof(arp->ip_src)); rlb_src_unlink(bond, hash_index); rlb_src_link(bond, hash_src, hash_index); } client_info->ip_src = arp->ip_src; client_info->ip_dst = arp->ip_dst; /* arp->mac_dst is broadcast for arp requests. * will be updated with clients actual unicast mac address * upon receiving an arp reply. */ ether_addr_copy(client_info->mac_dst, arp->mac_dst); ether_addr_copy(client_info->mac_src, arp->mac_src); client_info->slave = assigned_slave; if (is_valid_ether_addr(client_info->mac_dst)) { client_info->ntt = 1; bond->alb_info.rx_ntt = 1; } else { client_info->ntt = 0; } if (vlan_get_tag(skb, &client_info->vlan_id)) client_info->vlan_id = 0; if (!client_info->assigned) { u32 prev_tbl_head = bond_info->rx_hashtbl_used_head; bond_info->rx_hashtbl_used_head = hash_index; client_info->used_next = prev_tbl_head; if (prev_tbl_head != RLB_NULL_INDEX) { bond_info->rx_hashtbl[prev_tbl_head].used_prev = hash_index; } client_info->assigned = 1; } } spin_unlock(&bond->mode_lock); return assigned_slave; } /* chooses (and returns) transmit channel for arp reply * does not choose channel for other arp types since they are * sent on the curr_active_slave */ static struct slave *rlb_arp_xmit(struct sk_buff *skb, struct bonding *bond) { struct slave *tx_slave = NULL; struct net_device *dev; struct arp_pkt *arp; if (!pskb_network_may_pull(skb, sizeof(*arp))) return NULL; arp = (struct arp_pkt *)skb_network_header(skb); /* Don't modify or load balance ARPs that do not originate * from the bond itself or a VLAN directly above the bond. */ if (!bond_slave_has_mac_rcu(bond, arp->mac_src)) return NULL; dev = ip_dev_find(dev_net(bond->dev), arp->ip_src); if (dev) { if (netif_is_any_bridge_master(dev)) { dev_put(dev); return NULL; } dev_put(dev); } if (arp->op_code == htons(ARPOP_REPLY)) { /* the arp must be sent on the selected rx channel */ tx_slave = rlb_choose_channel(skb, bond, arp); if (tx_slave) bond_hw_addr_copy(arp->mac_src, tx_slave->dev->dev_addr, tx_slave->dev->addr_len); netdev_dbg(bond->dev, "(slave %s): Server sent ARP Reply packet\n", tx_slave ? tx_slave->dev->name : "NULL"); } else if (arp->op_code == htons(ARPOP_REQUEST)) { /* Create an entry in the rx_hashtbl for this client as a * place holder. * When the arp reply is received the entry will be updated * with the correct unicast address of the client. */ tx_slave = rlb_choose_channel(skb, bond, arp); /* The ARP reply packets must be delayed so that * they can cancel out the influence of the ARP request. */ bond->alb_info.rlb_update_delay_counter = RLB_UPDATE_DELAY; /* arp requests are broadcast and are sent on the primary * the arp request will collapse all clients on the subnet to * the primary slave. We must register these clients to be * updated with their assigned mac. */ rlb_req_update_subnet_clients(bond, arp->ip_src); netdev_dbg(bond->dev, "(slave %s): Server sent ARP Request packet\n", tx_slave ? tx_slave->dev->name : "NULL"); } return tx_slave; } static void rlb_rebalance(struct bonding *bond) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); struct slave *assigned_slave; struct rlb_client_info *client_info; int ntt; u32 hash_index; spin_lock_bh(&bond->mode_lock); ntt = 0; hash_index = bond_info->rx_hashtbl_used_head; for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) { client_info = &(bond_info->rx_hashtbl[hash_index]); assigned_slave = __rlb_next_rx_slave(bond); if (assigned_slave && (client_info->slave != assigned_slave)) { client_info->slave = assigned_slave; if (!is_zero_ether_addr(client_info->mac_dst)) { client_info->ntt = 1; ntt = 1; } } } /* update the team's flag only after the whole iteration */ if (ntt) bond_info->rx_ntt = 1; spin_unlock_bh(&bond->mode_lock); } /* Caller must hold mode_lock */ static void rlb_init_table_entry_dst(struct rlb_client_info *entry) { entry->used_next = RLB_NULL_INDEX; entry->used_prev = RLB_NULL_INDEX; entry->assigned = 0; entry->slave = NULL; entry->vlan_id = 0; } static void rlb_init_table_entry_src(struct rlb_client_info *entry) { entry->src_first = RLB_NULL_INDEX; entry->src_prev = RLB_NULL_INDEX; entry->src_next = RLB_NULL_INDEX; } static void rlb_init_table_entry(struct rlb_client_info *entry) { memset(entry, 0, sizeof(struct rlb_client_info)); rlb_init_table_entry_dst(entry); rlb_init_table_entry_src(entry); } static void rlb_delete_table_entry_dst(struct bonding *bond, u32 index) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); u32 next_index = bond_info->rx_hashtbl[index].used_next; u32 prev_index = bond_info->rx_hashtbl[index].used_prev; if (index == bond_info->rx_hashtbl_used_head) bond_info->rx_hashtbl_used_head = next_index; if (prev_index != RLB_NULL_INDEX) bond_info->rx_hashtbl[prev_index].used_next = next_index; if (next_index != RLB_NULL_INDEX) bond_info->rx_hashtbl[next_index].used_prev = prev_index; } /* unlink a rlb hash table entry from the src list */ static void rlb_src_unlink(struct bonding *bond, u32 index) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); u32 next_index = bond_info->rx_hashtbl[index].src_next; u32 prev_index = bond_info->rx_hashtbl[index].src_prev; bond_info->rx_hashtbl[index].src_next = RLB_NULL_INDEX; bond_info->rx_hashtbl[index].src_prev = RLB_NULL_INDEX; if (next_index != RLB_NULL_INDEX) bond_info->rx_hashtbl[next_index].src_prev = prev_index; if (prev_index == RLB_NULL_INDEX) return; /* is prev_index pointing to the head of this list? */ if (bond_info->rx_hashtbl[prev_index].src_first == index) bond_info->rx_hashtbl[prev_index].src_first = next_index; else bond_info->rx_hashtbl[prev_index].src_next = next_index; } static void rlb_delete_table_entry(struct bonding *bond, u32 index) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); struct rlb_client_info *entry = &(bond_info->rx_hashtbl[index]); rlb_delete_table_entry_dst(bond, index); rlb_init_table_entry_dst(entry); rlb_src_unlink(bond, index); } /* add the rx_hashtbl[ip_dst_hash] entry to the list * of entries with identical ip_src_hash */ static void rlb_src_link(struct bonding *bond, u32 ip_src_hash, u32 ip_dst_hash) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); u32 next; bond_info->rx_hashtbl[ip_dst_hash].src_prev = ip_src_hash; next = bond_info->rx_hashtbl[ip_src_hash].src_first; bond_info->rx_hashtbl[ip_dst_hash].src_next = next; if (next != RLB_NULL_INDEX) bond_info->rx_hashtbl[next].src_prev = ip_dst_hash; bond_info->rx_hashtbl[ip_src_hash].src_first = ip_dst_hash; } /* deletes all rx_hashtbl entries with arp->ip_src if their mac_src does * not match arp->mac_src */ static void rlb_purge_src_ip(struct bonding *bond, struct arp_pkt *arp) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); u32 ip_src_hash = _simple_hash((u8 *)&(arp->ip_src), sizeof(arp->ip_src)); u32 index; spin_lock_bh(&bond->mode_lock); index = bond_info->rx_hashtbl[ip_src_hash].src_first; while (index != RLB_NULL_INDEX) { struct rlb_client_info *entry = &(bond_info->rx_hashtbl[index]); u32 next_index = entry->src_next; if (entry->ip_src == arp->ip_src && !ether_addr_equal_64bits(arp->mac_src, entry->mac_src)) rlb_delete_table_entry(bond, index); index = next_index; } spin_unlock_bh(&bond->mode_lock); } static int rlb_initialize(struct bonding *bond) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); struct rlb_client_info *new_hashtbl; int size = RLB_HASH_TABLE_SIZE * sizeof(struct rlb_client_info); int i; new_hashtbl = kmalloc(size, GFP_KERNEL); if (!new_hashtbl) return -1; spin_lock_bh(&bond->mode_lock); bond_info->rx_hashtbl = new_hashtbl; bond_info->rx_hashtbl_used_head = RLB_NULL_INDEX; for (i = 0; i < RLB_HASH_TABLE_SIZE; i++) rlb_init_table_entry(bond_info->rx_hashtbl + i); spin_unlock_bh(&bond->mode_lock); /* register to receive ARPs */ bond->recv_probe = rlb_arp_recv; return 0; } static void rlb_deinitialize(struct bonding *bond) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); spin_lock_bh(&bond->mode_lock); kfree(bond_info->rx_hashtbl); bond_info->rx_hashtbl = NULL; bond_info->rx_hashtbl_used_head = RLB_NULL_INDEX; spin_unlock_bh(&bond->mode_lock); } static void rlb_clear_vlan(struct bonding *bond, unsigned short vlan_id) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); u32 curr_index; spin_lock_bh(&bond->mode_lock); curr_index = bond_info->rx_hashtbl_used_head; while (curr_index != RLB_NULL_INDEX) { struct rlb_client_info *curr = &(bond_info->rx_hashtbl[curr_index]); u32 next_index = bond_info->rx_hashtbl[curr_index].used_next; if (curr->vlan_id == vlan_id) rlb_delete_table_entry(bond, curr_index); curr_index = next_index; } spin_unlock_bh(&bond->mode_lock); } /*********************** tlb/rlb shared functions *********************/ static void alb_send_lp_vid(struct slave *slave, const u8 mac_addr[], __be16 vlan_proto, u16 vid) { struct learning_pkt pkt; struct sk_buff *skb; int size = sizeof(struct learning_pkt); memset(&pkt, 0, size); ether_addr_copy(pkt.mac_dst, mac_addr); ether_addr_copy(pkt.mac_src, mac_addr); pkt.type = cpu_to_be16(ETH_P_LOOPBACK); skb = dev_alloc_skb(size); if (!skb) return; skb_put_data(skb, &pkt, size); skb_reset_mac_header(skb); skb->network_header = skb->mac_header + ETH_HLEN; skb->protocol = pkt.type; skb->priority = TC_PRIO_CONTROL; skb->dev = slave->dev; slave_dbg(slave->bond->dev, slave->dev, "Send learning packet: mac %pM vlan %d\n", mac_addr, vid); if (vid) __vlan_hwaccel_put_tag(skb, vlan_proto, vid); dev_queue_xmit(skb); } struct alb_walk_data { struct bonding *bond; struct slave *slave; const u8 *mac_addr; bool strict_match; }; static int alb_upper_dev_walk(struct net_device *upper, struct netdev_nested_priv *priv) { struct alb_walk_data *data = (struct alb_walk_data *)priv->data; bool strict_match = data->strict_match; const u8 *mac_addr = data->mac_addr; struct bonding *bond = data->bond; struct slave *slave = data->slave; struct bond_vlan_tag *tags; if (is_vlan_dev(upper) && bond->dev->lower_level == upper->lower_level - 1) { if (upper->addr_assign_type == NET_ADDR_STOLEN) { alb_send_lp_vid(slave, mac_addr, vlan_dev_vlan_proto(upper), vlan_dev_vlan_id(upper)); } else { alb_send_lp_vid(slave, upper->dev_addr, vlan_dev_vlan_proto(upper), vlan_dev_vlan_id(upper)); } } /* If this is a macvlan device, then only send updates * when strict_match is turned off. */ if (netif_is_macvlan(upper) && !strict_match) { tags = bond_verify_device_path(bond->dev, upper, 0); if (IS_ERR_OR_NULL(tags)) return -ENOMEM; alb_send_lp_vid(slave, upper->dev_addr, tags[0].vlan_proto, tags[0].vlan_id); kfree(tags); } return 0; } static void alb_send_learning_packets(struct slave *slave, const u8 mac_addr[], bool strict_match) { struct bonding *bond = bond_get_bond_by_slave(slave); struct netdev_nested_priv priv; struct alb_walk_data data = { .strict_match = strict_match, .mac_addr = mac_addr, .slave = slave, .bond = bond, }; priv.data = (void *)&data; /* send untagged */ alb_send_lp_vid(slave, mac_addr, 0, 0); /* loop through all devices and see if we need to send a packet * for that device. */ rcu_read_lock(); netdev_walk_all_upper_dev_rcu(bond->dev, alb_upper_dev_walk, &priv); rcu_read_unlock(); } static int alb_set_slave_mac_addr(struct slave *slave, const u8 addr[], unsigned int len) { struct net_device *dev = slave->dev; struct sockaddr_storage ss; if (BOND_MODE(slave->bond) == BOND_MODE_TLB) { __dev_addr_set(dev, addr, len); return 0; } /* for rlb each slave must have a unique hw mac addresses so that * each slave will receive packets destined to a different mac */ memcpy(ss.__data, addr, len); ss.ss_family = dev->type; if (dev_set_mac_address(dev, &ss, NULL)) { slave_err(slave->bond->dev, dev, "dev_set_mac_address on slave failed! ALB mode requires that the base driver support setting the hw address also when the network device's interface is open\n"); return -EOPNOTSUPP; } return 0; } /* Swap MAC addresses between two slaves. * * Called with RTNL held, and no other locks. */ static void alb_swap_mac_addr(struct slave *slave1, struct slave *slave2) { u8 tmp_mac_addr[MAX_ADDR_LEN]; bond_hw_addr_copy(tmp_mac_addr, slave1->dev->dev_addr, slave1->dev->addr_len); alb_set_slave_mac_addr(slave1, slave2->dev->dev_addr, slave2->dev->addr_len); alb_set_slave_mac_addr(slave2, tmp_mac_addr, slave1->dev->addr_len); } /* Send learning packets after MAC address swap. * * Called with RTNL and no other locks */ static void alb_fasten_mac_swap(struct bonding *bond, struct slave *slave1, struct slave *slave2) { int slaves_state_differ = (bond_slave_can_tx(slave1) != bond_slave_can_tx(slave2)); struct slave *disabled_slave = NULL; ASSERT_RTNL(); /* fasten the change in the switch */ if (bond_slave_can_tx(slave1)) { alb_send_learning_packets(slave1, slave1->dev->dev_addr, false); if (bond->alb_info.rlb_enabled) { /* inform the clients that the mac address * has changed */ rlb_req_update_slave_clients(bond, slave1); } } else { disabled_slave = slave1; } if (bond_slave_can_tx(slave2)) { alb_send_learning_packets(slave2, slave2->dev->dev_addr, false); if (bond->alb_info.rlb_enabled) { /* inform the clients that the mac address * has changed */ rlb_req_update_slave_clients(bond, slave2); } } else { disabled_slave = slave2; } if (bond->alb_info.rlb_enabled && slaves_state_differ) { /* A disabled slave was assigned an active mac addr */ rlb_teach_disabled_mac_on_primary(bond, disabled_slave->dev->dev_addr); } } /** * alb_change_hw_addr_on_detach * @bond: bonding we're working on * @slave: the slave that was just detached * * We assume that @slave was already detached from the slave list. * * If @slave's permanent hw address is different both from its current * address and from @bond's address, then somewhere in the bond there's * a slave that has @slave's permanet address as its current address. * We'll make sure that slave no longer uses @slave's permanent address. * * Caller must hold RTNL and no other locks */ static void alb_change_hw_addr_on_detach(struct bonding *bond, struct slave *slave) { int perm_curr_diff; int perm_bond_diff; struct slave *found_slave; perm_curr_diff = !ether_addr_equal_64bits(slave->perm_hwaddr, slave->dev->dev_addr); perm_bond_diff = !ether_addr_equal_64bits(slave->perm_hwaddr, bond->dev->dev_addr); if (perm_curr_diff && perm_bond_diff) { found_slave = bond_slave_has_mac(bond, slave->perm_hwaddr); if (found_slave) { alb_swap_mac_addr(slave, found_slave); alb_fasten_mac_swap(bond, slave, found_slave); } } } /** * alb_handle_addr_collision_on_attach * @bond: bonding we're working on * @slave: the slave that was just attached * * checks uniqueness of slave's mac address and handles the case the * new slave uses the bonds mac address. * * If the permanent hw address of @slave is @bond's hw address, we need to * find a different hw address to give @slave, that isn't in use by any other * slave in the bond. This address must be, of course, one of the permanent * addresses of the other slaves. * * We go over the slave list, and for each slave there we compare its * permanent hw address with the current address of all the other slaves. * If no match was found, then we've found a slave with a permanent address * that isn't used by any other slave in the bond, so we can assign it to * @slave. * * assumption: this function is called before @slave is attached to the * bond slave list. */ static int alb_handle_addr_collision_on_attach(struct bonding *bond, struct slave *slave) { struct slave *has_bond_addr = rcu_access_pointer(bond->curr_active_slave); struct slave *tmp_slave1, *free_mac_slave = NULL; struct list_head *iter; if (!bond_has_slaves(bond)) { /* this is the first slave */ return 0; } /* if slave's mac address differs from bond's mac address * check uniqueness of slave's mac address against the other * slaves in the bond. */ if (!ether_addr_equal_64bits(slave->perm_hwaddr, bond->dev->dev_addr)) { if (!bond_slave_has_mac(bond, slave->dev->dev_addr)) return 0; /* Try setting slave mac to bond address and fall-through * to code handling that situation below... */ alb_set_slave_mac_addr(slave, bond->dev->dev_addr, bond->dev->addr_len); } /* The slave's address is equal to the address of the bond. * Search for a spare address in the bond for this slave. */ bond_for_each_slave(bond, tmp_slave1, iter) { if (!bond_slave_has_mac(bond, tmp_slave1->perm_hwaddr)) { /* no slave has tmp_slave1's perm addr * as its curr addr */ free_mac_slave = tmp_slave1; break; } if (!has_bond_addr) { if (ether_addr_equal_64bits(tmp_slave1->dev->dev_addr, bond->dev->dev_addr)) { has_bond_addr = tmp_slave1; } } } if (free_mac_slave) { alb_set_slave_mac_addr(slave, free_mac_slave->perm_hwaddr, free_mac_slave->dev->addr_len); slave_warn(bond->dev, slave->dev, "the slave hw address is in use by the bond; giving it the hw address of %s\n", free_mac_slave->dev->name); } else if (has_bond_addr) { slave_err(bond->dev, slave->dev, "the slave hw address is in use by the bond; couldn't find a slave with a free hw address to give it (this should not have happened)\n"); return -EFAULT; } return 0; } /** * alb_set_mac_address * @bond: bonding we're working on * @addr: MAC address to set * * In TLB mode all slaves are configured to the bond's hw address, but set * their dev_addr field to different addresses (based on their permanent hw * addresses). * * For each slave, this function sets the interface to the new address and then * changes its dev_addr field to its previous value. * * Unwinding assumes bond's mac address has not yet changed. */ static int alb_set_mac_address(struct bonding *bond, void *addr) { struct slave *slave, *rollback_slave; struct list_head *iter; struct sockaddr_storage ss; char tmp_addr[MAX_ADDR_LEN]; int res; if (bond->alb_info.rlb_enabled) return 0; bond_for_each_slave(bond, slave, iter) { /* save net_device's current hw address */ bond_hw_addr_copy(tmp_addr, slave->dev->dev_addr, slave->dev->addr_len); res = dev_set_mac_address(slave->dev, addr, NULL); /* restore net_device's hw address */ dev_addr_set(slave->dev, tmp_addr); if (res) goto unwind; } return 0; unwind: memcpy(ss.__data, bond->dev->dev_addr, bond->dev->addr_len); ss.ss_family = bond->dev->type; /* unwind from head to the slave that failed */ bond_for_each_slave(bond, rollback_slave, iter) { if (rollback_slave == slave) break; bond_hw_addr_copy(tmp_addr, rollback_slave->dev->dev_addr, rollback_slave->dev->addr_len); dev_set_mac_address(rollback_slave->dev, &ss, NULL); dev_addr_set(rollback_slave->dev, tmp_addr); } return res; } /* determine if the packet is NA or NS */ static bool alb_determine_nd(struct sk_buff *skb, struct bonding *bond) { struct ipv6hdr *ip6hdr; struct icmp6hdr *hdr; if (!pskb_network_may_pull(skb, sizeof(*ip6hdr))) return true; ip6hdr = ipv6_hdr(skb); if (ip6hdr->nexthdr != IPPROTO_ICMPV6) return false; if (!pskb_network_may_pull(skb, sizeof(*ip6hdr) + sizeof(*hdr))) return true; hdr = icmp6_hdr(skb); return hdr->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT || hdr->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION; } /************************ exported alb functions ************************/ int bond_alb_initialize(struct bonding *bond, int rlb_enabled) { int res; res = tlb_initialize(bond); if (res) return res; if (rlb_enabled) { res = rlb_initialize(bond); if (res) { tlb_deinitialize(bond); return res; } bond->alb_info.rlb_enabled = 1; } else { bond->alb_info.rlb_enabled = 0; } return 0; } void bond_alb_deinitialize(struct bonding *bond) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); tlb_deinitialize(bond); if (bond_info->rlb_enabled) rlb_deinitialize(bond); } static netdev_tx_t bond_do_alb_xmit(struct sk_buff *skb, struct bonding *bond, struct slave *tx_slave) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); struct ethhdr *eth_data = eth_hdr(skb); if (!tx_slave) { /* unbalanced or unassigned, send through primary */ tx_slave = rcu_dereference(bond->curr_active_slave); if (bond->params.tlb_dynamic_lb) bond_info->unbalanced_load += skb->len; } if (tx_slave && bond_slave_can_tx(tx_slave)) { if (tx_slave != rcu_access_pointer(bond->curr_active_slave)) { ether_addr_copy(eth_data->h_source, tx_slave->dev->dev_addr); } return bond_dev_queue_xmit(bond, skb, tx_slave->dev); } if (tx_slave && bond->params.tlb_dynamic_lb) { spin_lock(&bond->mode_lock); __tlb_clear_slave(bond, tx_slave, 0); spin_unlock(&bond->mode_lock); } /* no suitable interface, frame not sent */ return bond_tx_drop(bond->dev, skb); } struct slave *bond_xmit_tlb_slave_get(struct bonding *bond, struct sk_buff *skb) { struct slave *tx_slave = NULL; struct ethhdr *eth_data; u32 hash_index; skb_reset_mac_header(skb); eth_data = eth_hdr(skb); /* Do not TX balance any multicast or broadcast */ if (!is_multicast_ether_addr(eth_data->h_dest)) { switch (skb->protocol) { case htons(ETH_P_IPV6): if (alb_determine_nd(skb, bond)) break; fallthrough; case htons(ETH_P_IP): hash_index = bond_xmit_hash(bond, skb); if (bond->params.tlb_dynamic_lb) { tx_slave = tlb_choose_channel(bond, hash_index & 0xFF, skb->len); } else { struct bond_up_slave *slaves; unsigned int count; slaves = rcu_dereference(bond->usable_slaves); count = slaves ? READ_ONCE(slaves->count) : 0; if (likely(count)) tx_slave = slaves->arr[hash_index % count]; } break; } } return tx_slave; } netdev_tx_t bond_tlb_xmit(struct sk_buff *skb, struct net_device *bond_dev) { struct bonding *bond = netdev_priv(bond_dev); struct slave *tx_slave; tx_slave = bond_xmit_tlb_slave_get(bond, skb); return bond_do_alb_xmit(skb, bond, tx_slave); } struct slave *bond_xmit_alb_slave_get(struct bonding *bond, struct sk_buff *skb) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); static const __be32 ip_bcast = htonl(0xffffffff); struct slave *tx_slave = NULL; const u8 *hash_start = NULL; bool do_tx_balance = true; struct ethhdr *eth_data; u32 hash_index = 0; int hash_size = 0; skb_reset_mac_header(skb); eth_data = eth_hdr(skb); switch (ntohs(skb->protocol)) { case ETH_P_IP: { const struct iphdr *iph; if (is_broadcast_ether_addr(eth_data->h_dest) || !pskb_network_may_pull(skb, sizeof(*iph))) { do_tx_balance = false; break; } iph = ip_hdr(skb); if (iph->daddr == ip_bcast || iph->protocol == IPPROTO_IGMP) { do_tx_balance = false; break; } hash_start = (char *)&(iph->daddr); hash_size = sizeof(iph->daddr); break; } case ETH_P_IPV6: { const struct ipv6hdr *ip6hdr; /* IPv6 doesn't really use broadcast mac address, but leave * that here just in case. */ if (is_broadcast_ether_addr(eth_data->h_dest)) { do_tx_balance = false; break; } /* IPv6 uses all-nodes multicast as an equivalent to * broadcasts in IPv4. */ if (ether_addr_equal_64bits(eth_data->h_dest, mac_v6_allmcast)) { do_tx_balance = false; break; } if (alb_determine_nd(skb, bond)) { do_tx_balance = false; break; } /* The IPv6 header is pulled by alb_determine_nd */ /* Additionally, DAD probes should not be tx-balanced as that * will lead to false positives for duplicate addresses and * prevent address configuration from working. */ ip6hdr = ipv6_hdr(skb); if (ipv6_addr_any(&ip6hdr->saddr)) { do_tx_balance = false; break; } hash_start = (char *)&ip6hdr->daddr; hash_size = sizeof(ip6hdr->daddr); break; } case ETH_P_ARP: do_tx_balance = false; if (bond_info->rlb_enabled) tx_slave = rlb_arp_xmit(skb, bond); break; default: do_tx_balance = false; break; } if (do_tx_balance) { if (bond->params.tlb_dynamic_lb) { hash_index = _simple_hash(hash_start, hash_size); tx_slave = tlb_choose_channel(bond, hash_index, skb->len); } else { /* * do_tx_balance means we are free to select the tx_slave * So we do exactly what tlb would do for hash selection */ struct bond_up_slave *slaves; unsigned int count; slaves = rcu_dereference(bond->usable_slaves); count = slaves ? READ_ONCE(slaves->count) : 0; if (likely(count)) tx_slave = slaves->arr[bond_xmit_hash(bond, skb) % count]; } } return tx_slave; } netdev_tx_t bond_alb_xmit(struct sk_buff *skb, struct net_device *bond_dev) { struct bonding *bond = netdev_priv(bond_dev); struct slave *tx_slave = NULL; tx_slave = bond_xmit_alb_slave_get(bond, skb); return bond_do_alb_xmit(skb, bond, tx_slave); } void bond_alb_monitor(struct work_struct *work) { struct bonding *bond = container_of(work, struct bonding, alb_work.work); struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); struct list_head *iter; struct slave *slave; if (!bond_has_slaves(bond)) { atomic_set(&bond_info->tx_rebalance_counter, 0); bond_info->lp_counter = 0; goto re_arm; } rcu_read_lock(); atomic_inc(&bond_info->tx_rebalance_counter); bond_info->lp_counter++; /* send learning packets */ if (bond_info->lp_counter >= BOND_ALB_LP_TICKS(bond)) { bool strict_match; bond_for_each_slave_rcu(bond, slave, iter) { /* If updating current_active, use all currently * user mac addresses (!strict_match). Otherwise, only * use mac of the slave device. * In RLB mode, we always use strict matches. */ strict_match = (slave != rcu_access_pointer(bond->curr_active_slave) || bond_info->rlb_enabled); alb_send_learning_packets(slave, slave->dev->dev_addr, strict_match); } bond_info->lp_counter = 0; } /* rebalance tx traffic */ if (atomic_read(&bond_info->tx_rebalance_counter) >= BOND_TLB_REBALANCE_TICKS) { bond_for_each_slave_rcu(bond, slave, iter) { tlb_clear_slave(bond, slave, 1); if (slave == rcu_access_pointer(bond->curr_active_slave)) { SLAVE_TLB_INFO(slave).load = bond_info->unbalanced_load / BOND_TLB_REBALANCE_INTERVAL; bond_info->unbalanced_load = 0; } } atomic_set(&bond_info->tx_rebalance_counter, 0); } if (bond_info->rlb_enabled) { if (bond_info->primary_is_promisc && (++bond_info->rlb_promisc_timeout_counter >= RLB_PROMISC_TIMEOUT)) { /* dev_set_promiscuity requires rtnl and * nothing else. Avoid race with bond_close. */ rcu_read_unlock(); if (!rtnl_trylock()) goto re_arm; bond_info->rlb_promisc_timeout_counter = 0; /* If the primary was set to promiscuous mode * because a slave was disabled then * it can now leave promiscuous mode. */ dev_set_promiscuity(rtnl_dereference(bond->curr_active_slave)->dev, -1); bond_info->primary_is_promisc = 0; rtnl_unlock(); rcu_read_lock(); } if (bond_info->rlb_rebalance) { bond_info->rlb_rebalance = 0; rlb_rebalance(bond); } /* check if clients need updating */ if (bond_info->rx_ntt) { if (bond_info->rlb_update_delay_counter) { --bond_info->rlb_update_delay_counter; } else { rlb_update_rx_clients(bond); if (bond_info->rlb_update_retry_counter) --bond_info->rlb_update_retry_counter; else bond_info->rx_ntt = 0; } } } rcu_read_unlock(); re_arm: queue_delayed_work(bond->wq, &bond->alb_work, alb_delta_in_ticks); } /* assumption: called before the slave is attached to the bond * and not locked by the bond lock */ int bond_alb_init_slave(struct bonding *bond, struct slave *slave) { int res; res = alb_set_slave_mac_addr(slave, slave->perm_hwaddr, slave->dev->addr_len); if (res) return res; res = alb_handle_addr_collision_on_attach(bond, slave); if (res) return res; tlb_init_slave(slave); /* order a rebalance ASAP */ atomic_set(&bond->alb_info.tx_rebalance_counter, BOND_TLB_REBALANCE_TICKS); if (bond->alb_info.rlb_enabled) bond->alb_info.rlb_rebalance = 1; return 0; } /* Remove slave from tlb and rlb hash tables, and fix up MAC addresses * if necessary. * * Caller must hold RTNL and no other locks */ void bond_alb_deinit_slave(struct bonding *bond, struct slave *slave) { if (bond_has_slaves(bond)) alb_change_hw_addr_on_detach(bond, slave); tlb_clear_slave(bond, slave, 0); if (bond->alb_info.rlb_enabled) { bond->alb_info.rx_slave = NULL; rlb_clear_slave(bond, slave); } } void bond_alb_handle_link_change(struct bonding *bond, struct slave *slave, char link) { struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond)); if (link == BOND_LINK_DOWN) { tlb_clear_slave(bond, slave, 0); if (bond->alb_info.rlb_enabled) rlb_clear_slave(bond, slave); } else if (link == BOND_LINK_UP) { /* order a rebalance ASAP */ atomic_set(&bond_info->tx_rebalance_counter, BOND_TLB_REBALANCE_TICKS); if (bond->alb_info.rlb_enabled) { bond->alb_info.rlb_rebalance = 1; /* If the updelay module parameter is smaller than the * forwarding delay of the switch the rebalance will * not work because the rebalance arp replies will * not be forwarded to the clients.. */ } } if (bond_is_nondyn_tlb(bond)) { if (bond_update_slave_arr(bond, NULL)) pr_err("Failed to build slave-array for TLB mode.\n"); } } /** * bond_alb_handle_active_change - assign new curr_active_slave * @bond: our bonding struct * @new_slave: new slave to assign * * Set the bond->curr_active_slave to @new_slave and handle * mac address swapping and promiscuity changes as needed. * * Caller must hold RTNL */ void bond_alb_handle_active_change(struct bonding *bond, struct slave *new_slave) { struct slave *swap_slave; struct slave *curr_active; curr_active = rtnl_dereference(bond->curr_active_slave); if (curr_active == new_slave) return; if (curr_active && bond->alb_info.primary_is_promisc) { dev_set_promiscuity(curr_active->dev, -1); bond->alb_info.primary_is_promisc = 0; bond->alb_info.rlb_promisc_timeout_counter = 0; } swap_slave = curr_active; rcu_assign_pointer(bond->curr_active_slave, new_slave); if (!new_slave || !bond_has_slaves(bond)) return; /* set the new curr_active_slave to the bonds mac address * i.e. swap mac addresses of old curr_active_slave and new curr_active_slave */ if (!swap_slave) swap_slave = bond_slave_has_mac(bond, bond->dev->dev_addr); /* Arrange for swap_slave and new_slave to temporarily be * ignored so we can mess with their MAC addresses without * fear of interference from transmit activity. */ if (swap_slave) tlb_clear_slave(bond, swap_slave, 1); tlb_clear_slave(bond, new_slave, 1); /* in TLB mode, the slave might flip down/up with the old dev_addr, * and thus filter bond->dev_addr's packets, so force bond's mac */ if (BOND_MODE(bond) == BOND_MODE_TLB) { struct sockaddr_storage ss; u8 tmp_addr[MAX_ADDR_LEN]; bond_hw_addr_copy(tmp_addr, new_slave->dev->dev_addr, new_slave->dev->addr_len); bond_hw_addr_copy(ss.__data, bond->dev->dev_addr, bond->dev->addr_len); ss.ss_family = bond->dev->type; /* we don't care if it can't change its mac, best effort */ dev_set_mac_address(new_slave->dev, &ss, NULL); dev_addr_set(new_slave->dev, tmp_addr); } /* curr_active_slave must be set before calling alb_swap_mac_addr */ if (swap_slave) { /* swap mac address */ alb_swap_mac_addr(swap_slave, new_slave); alb_fasten_mac_swap(bond, swap_slave, new_slave); } else { /* set the new_slave to the bond mac address */ alb_set_slave_mac_addr(new_slave, bond->dev->dev_addr, bond->dev->addr_len); alb_send_learning_packets(new_slave, bond->dev->dev_addr, false); } } /* Called with RTNL */ int bond_alb_set_mac_address(struct net_device *bond_dev, void *addr) { struct bonding *bond = netdev_priv(bond_dev); struct sockaddr_storage *ss = addr; struct slave *curr_active; struct slave *swap_slave; int res; if (!is_valid_ether_addr(ss->__data)) return -EADDRNOTAVAIL; res = alb_set_mac_address(bond, addr); if (res) return res; dev_addr_set(bond_dev, ss->__data); /* If there is no curr_active_slave there is nothing else to do. * Otherwise we'll need to pass the new address to it and handle * duplications. */ curr_active = rtnl_dereference(bond->curr_active_slave); if (!curr_active) return 0; swap_slave = bond_slave_has_mac(bond, bond_dev->dev_addr); if (swap_slave) { alb_swap_mac_addr(swap_slave, curr_active); alb_fasten_mac_swap(bond, swap_slave, curr_active); } else { alb_set_slave_mac_addr(curr_active, bond_dev->dev_addr, bond_dev->addr_len); alb_send_learning_packets(curr_active, bond_dev->dev_addr, false); if (bond->alb_info.rlb_enabled) { /* inform clients mac address has changed */ rlb_req_update_slave_clients(bond, curr_active); } } return 0; } void bond_alb_clear_vlan(struct bonding *bond, unsigned short vlan_id) { if (bond->alb_info.rlb_enabled) rlb_clear_vlan(bond, vlan_id); } |
| 5409 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | /* SPDX-License-Identifier: GPL-2.0-or-later */ /* * INET An implementation of the TCP/IP protocol suite for the LINUX * operating system. INET is implemented using the BSD Socket * interface as the means of communication with the user level. * * Global definitions for the Ethernet IEEE 802.3 interface. * * Version: @(#)if_ether.h 1.0.1a 02/08/94 * * Author: Fred N. van Kempen, <waltje@uWalt.NL.Mugnet.ORG> * Donald Becker, <becker@super.org> * Alan Cox, <alan@lxorguk.ukuu.org.uk> * Steve Whitehouse, <gw7rrm@eeshack3.swan.ac.uk> */ #ifndef _LINUX_IF_ETHER_H #define _LINUX_IF_ETHER_H #include <linux/skbuff.h> #include <uapi/linux/if_ether.h> /* XX:XX:XX:XX:XX:XX */ #define MAC_ADDR_STR_LEN (3 * ETH_ALEN - 1) static inline struct ethhdr *eth_hdr(const struct sk_buff *skb) { return (struct ethhdr *)skb_mac_header(skb); } /* Prefer this version in TX path, instead of * skb_reset_mac_header() + eth_hdr() */ static inline struct ethhdr *skb_eth_hdr(const struct sk_buff *skb) { return (struct ethhdr *)skb->data; } static inline struct ethhdr *inner_eth_hdr(const struct sk_buff *skb) { return (struct ethhdr *)skb_inner_mac_header(skb); } int eth_header_parse(const struct sk_buff *skb, unsigned char *haddr); extern ssize_t sysfs_format_mac(char *buf, const unsigned char *addr, int len); #endif /* _LINUX_IF_ETHER_H */ |
| 5271 3714 6 1688 1385 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 | /* SPDX-License-Identifier: GPL-2.0+ */ #ifndef _LINUX_MAPLE_TREE_H #define _LINUX_MAPLE_TREE_H /* * Maple Tree - An RCU-safe adaptive tree for storing ranges * Copyright (c) 2018-2022 Oracle * Authors: Liam R. Howlett <Liam.Howlett@Oracle.com> * Matthew Wilcox <willy@infradead.org> */ #include <linux/kernel.h> #include <linux/rcupdate.h> #include <linux/spinlock.h> /* #define CONFIG_MAPLE_RCU_DISABLED */ /* * Allocated nodes are mutable until they have been inserted into the tree, * at which time they cannot change their type until they have been removed * from the tree and an RCU grace period has passed. * * Removed nodes have their ->parent set to point to themselves. RCU readers * check ->parent before relying on the value that they loaded from the * slots array. This lets us reuse the slots array for the RCU head. * * Nodes in the tree point to their parent unless bit 0 is set. */ #if defined(CONFIG_64BIT) || defined(BUILD_VDSO32_64) /* 64bit sizes */ #define MAPLE_NODE_SLOTS 31 /* 256 bytes including ->parent */ #define MAPLE_RANGE64_SLOTS 16 /* 256 bytes */ #define MAPLE_ARANGE64_SLOTS 10 /* 240 bytes */ #define MAPLE_ALLOC_SLOTS (MAPLE_NODE_SLOTS - 1) #else /* 32bit sizes */ #define MAPLE_NODE_SLOTS 63 /* 256 bytes including ->parent */ #define MAPLE_RANGE64_SLOTS 32 /* 256 bytes */ #define MAPLE_ARANGE64_SLOTS 21 /* 240 bytes */ #define MAPLE_ALLOC_SLOTS (MAPLE_NODE_SLOTS - 2) #endif /* defined(CONFIG_64BIT) || defined(BUILD_VDSO32_64) */ #define MAPLE_NODE_MASK 255UL /* * The node->parent of the root node has bit 0 set and the rest of the pointer * is a pointer to the tree itself. No more bits are available in this pointer * (on m68k, the data structure may only be 2-byte aligned). * * Internal non-root nodes can only have maple_range_* nodes as parents. The * parent pointer is 256B aligned like all other tree nodes. When storing a 32 * or 64 bit values, the offset can fit into 4 bits. The 16 bit values need an * extra bit to store the offset. This extra bit comes from a reuse of the last * bit in the node type. This is possible by using bit 1 to indicate if bit 2 * is part of the type or the slot. * * Once the type is decided, the decision of an allocation range type or a * range type is done by examining the immutable tree flag for the * MT_FLAGS_ALLOC_RANGE flag. * * Node types: * 0x??1 = Root * 0x?00 = 16 bit nodes * 0x010 = 32 bit nodes * 0x110 = 64 bit nodes * * Slot size and location in the parent pointer: * type : slot location * 0x??1 : Root * 0x?00 : 16 bit values, type in 0-1, slot in 2-6 * 0x010 : 32 bit values, type in 0-2, slot in 3-6 * 0x110 : 64 bit values, type in 0-2, slot in 3-6 */ /* * This metadata is used to optimize the gap updating code and in reverse * searching for gaps or any other code that needs to find the end of the data. */ struct maple_metadata { unsigned char end; /* end of data */ unsigned char gap; /* offset of largest gap */ }; /* * Leaf nodes do not store pointers to nodes, they store user data. Users may * store almost any bit pattern. As noted above, the optimisation of storing an * entry at 0 in the root pointer cannot be done for data which have the bottom * two bits set to '10'. We also reserve values with the bottom two bits set to * '10' which are below 4096 (ie 2, 6, 10 .. 4094) for internal use. Some APIs * return errnos as a negative errno shifted right by two bits and the bottom * two bits set to '10', and while choosing to store these values in the array * is not an error, it may lead to confusion if you're testing for an error with * mas_is_err(). * * Non-leaf nodes store the type of the node pointed to (enum maple_type in bits * 3-6), bit 2 is reserved. That leaves bits 0-1 unused for now. * * In regular B-Tree terms, pivots are called keys. The term pivot is used to * indicate that the tree is specifying ranges, Pivots may appear in the * subtree with an entry attached to the value whereas keys are unique to a * specific position of a B-tree. Pivot values are inclusive of the slot with * the same index. */ struct maple_range_64 { struct maple_pnode *parent; unsigned long pivot[MAPLE_RANGE64_SLOTS - 1]; union { void __rcu *slot[MAPLE_RANGE64_SLOTS]; struct { void __rcu *pad[MAPLE_RANGE64_SLOTS - 1]; struct maple_metadata meta; }; }; }; /* * At tree creation time, the user can specify that they're willing to trade off * storing fewer entries in a tree in return for storing more information in * each node. * * The maple tree supports recording the largest range of NULL entries available * in this node, also called gaps. This optimises the tree for allocating a * range. */ struct maple_arange_64 { struct maple_pnode *parent; unsigned long pivot[MAPLE_ARANGE64_SLOTS - 1]; void __rcu *slot[MAPLE_ARANGE64_SLOTS]; unsigned long gap[MAPLE_ARANGE64_SLOTS]; struct maple_metadata meta; }; struct maple_alloc { unsigned long total; unsigned char node_count; unsigned int request_count; struct maple_alloc *slot[MAPLE_ALLOC_SLOTS]; }; struct maple_topiary { struct maple_pnode *parent; struct maple_enode *next; /* Overlaps the pivot */ }; enum maple_type { maple_dense, maple_leaf_64, maple_range_64, maple_arange_64, }; enum store_type { wr_invalid, wr_new_root, wr_store_root, wr_exact_fit, wr_spanning_store, wr_split_store, wr_rebalance, wr_append, wr_node_store, wr_slot_store, }; /** * DOC: Maple tree flags * * * MT_FLAGS_ALLOC_RANGE - Track gaps in this tree * * MT_FLAGS_USE_RCU - Operate in RCU mode * * MT_FLAGS_HEIGHT_OFFSET - The position of the tree height in the flags * * MT_FLAGS_HEIGHT_MASK - The mask for the maple tree height value * * MT_FLAGS_LOCK_MASK - How the mt_lock is used * * MT_FLAGS_LOCK_IRQ - Acquired irq-safe * * MT_FLAGS_LOCK_BH - Acquired bh-safe * * MT_FLAGS_LOCK_EXTERN - mt_lock is not used * * MAPLE_HEIGHT_MAX The largest height that can be stored */ #define MT_FLAGS_ALLOC_RANGE 0x01 #define MT_FLAGS_USE_RCU 0x02 #define MT_FLAGS_HEIGHT_OFFSET 0x02 #define MT_FLAGS_HEIGHT_MASK 0x7C #define MT_FLAGS_LOCK_MASK 0x300 #define MT_FLAGS_LOCK_IRQ 0x100 #define MT_FLAGS_LOCK_BH 0x200 #define MT_FLAGS_LOCK_EXTERN 0x300 #define MT_FLAGS_ALLOC_WRAPPED 0x0800 #define MAPLE_HEIGHT_MAX 31 #define MAPLE_NODE_TYPE_MASK 0x0F #define MAPLE_NODE_TYPE_SHIFT 0x03 #define MAPLE_RESERVED_RANGE 4096 #ifdef CONFIG_LOCKDEP typedef struct lockdep_map *lockdep_map_p; #define mt_lock_is_held(mt) \ (!(mt)->ma_external_lock || lock_is_held((mt)->ma_external_lock)) #define mt_write_lock_is_held(mt) \ (!(mt)->ma_external_lock || \ lock_is_held_type((mt)->ma_external_lock, 0)) #define mt_set_external_lock(mt, lock) \ (mt)->ma_external_lock = &(lock)->dep_map #define mt_on_stack(mt) (mt).ma_external_lock = NULL #else typedef struct { /* nothing */ } lockdep_map_p; #define mt_lock_is_held(mt) 1 #define mt_write_lock_is_held(mt) 1 #define mt_set_external_lock(mt, lock) do { } while (0) #define mt_on_stack(mt) do { } while (0) #endif /* * If the tree contains a single entry at index 0, it is usually stored in * tree->ma_root. To optimise for the page cache, an entry which ends in '00', * '01' or '11' is stored in the root, but an entry which ends in '10' will be * stored in a node. Bits 3-6 are used to store enum maple_type. * * The flags are used both to store some immutable information about this tree * (set at tree creation time) and dynamic information set under the spinlock. * * Another use of flags are to indicate global states of the tree. This is the * case with the MT_FLAGS_USE_RCU flag, which indicates the tree is currently in * RCU mode. This mode was added to allow the tree to reuse nodes instead of * re-allocating and RCU freeing nodes when there is a single user. */ struct maple_tree { union { spinlock_t ma_lock; lockdep_map_p ma_external_lock; }; unsigned int ma_flags; void __rcu *ma_root; }; /** * MTREE_INIT() - Initialize a maple tree * @name: The maple tree name * @__flags: The maple tree flags * */ #define MTREE_INIT(name, __flags) { \ .ma_lock = __SPIN_LOCK_UNLOCKED((name).ma_lock), \ .ma_flags = __flags, \ .ma_root = NULL, \ } /** * MTREE_INIT_EXT() - Initialize a maple tree with an external lock. * @name: The tree name * @__flags: The maple tree flags * @__lock: The external lock */ #ifdef CONFIG_LOCKDEP #define MTREE_INIT_EXT(name, __flags, __lock) { \ .ma_external_lock = &(__lock).dep_map, \ .ma_flags = (__flags), \ .ma_root = NULL, \ } #else #define MTREE_INIT_EXT(name, __flags, __lock) MTREE_INIT(name, __flags) #endif #define DEFINE_MTREE(name) \ struct maple_tree name = MTREE_INIT(name, 0) #define mtree_lock(mt) spin_lock((&(mt)->ma_lock)) #define mtree_lock_nested(mas, subclass) \ spin_lock_nested((&(mt)->ma_lock), subclass) #define mtree_unlock(mt) spin_unlock((&(mt)->ma_lock)) /* * The Maple Tree squeezes various bits in at various points which aren't * necessarily obvious. Usually, this is done by observing that pointers are * N-byte aligned and thus the bottom log_2(N) bits are available for use. We * don't use the high bits of pointers to store additional information because * we don't know what bits are unused on any given architecture. * * Nodes are 256 bytes in size and are also aligned to 256 bytes, giving us 8 * low bits for our own purposes. Nodes are currently of 4 types: * 1. Single pointer (Range is 0-0) * 2. Non-leaf Allocation Range nodes * 3. Non-leaf Range nodes * 4. Leaf Range nodes All nodes consist of a number of node slots, * pivots, and a parent pointer. */ struct maple_node { union { struct { struct maple_pnode *parent; void __rcu *slot[MAPLE_NODE_SLOTS]; }; struct { void *pad; struct rcu_head rcu; struct maple_enode *piv_parent; unsigned char parent_slot; enum maple_type type; unsigned char slot_len; unsigned int ma_flags; }; struct maple_range_64 mr64; struct maple_arange_64 ma64; struct maple_alloc alloc; }; }; /* * More complicated stores can cause two nodes to become one or three and * potentially alter the height of the tree. Either half of the tree may need * to be rebalanced against the other. The ma_topiary struct is used to track * which nodes have been 'cut' from the tree so that the change can be done * safely at a later date. This is done to support RCU. */ struct ma_topiary { struct maple_enode *head; struct maple_enode *tail; struct maple_tree *mtree; }; void *mtree_load(struct maple_tree *mt, unsigned long index); int mtree_insert(struct maple_tree *mt, unsigned long index, void *entry, gfp_t gfp); int mtree_insert_range(struct maple_tree *mt, unsigned long first, unsigned long last, void *entry, gfp_t gfp); int mtree_alloc_range(struct maple_tree *mt, unsigned long *startp, void *entry, unsigned long size, unsigned long min, unsigned long max, gfp_t gfp); int mtree_alloc_cyclic(struct maple_tree *mt, unsigned long *startp, void *entry, unsigned long range_lo, unsigned long range_hi, unsigned long *next, gfp_t gfp); int mtree_alloc_rrange(struct maple_tree *mt, unsigned long *startp, void *entry, unsigned long size, unsigned long min, unsigned long max, gfp_t gfp); int mtree_store_range(struct maple_tree *mt, unsigned long first, unsigned long last, void *entry, gfp_t gfp); int mtree_store(struct maple_tree *mt, unsigned long index, void *entry, gfp_t gfp); void *mtree_erase(struct maple_tree *mt, unsigned long index); int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp); int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp); void mtree_destroy(struct maple_tree *mt); void __mt_destroy(struct maple_tree *mt); /** * mtree_empty() - Determine if a tree has any present entries. * @mt: Maple Tree. * * Context: Any context. * Return: %true if the tree contains only NULL pointers. */ static inline bool mtree_empty(const struct maple_tree *mt) { return mt->ma_root == NULL; } /* Advanced API */ /* * Maple State Status * ma_active means the maple state is pointing to a node and offset and can * continue operating on the tree. * ma_start means we have not searched the tree. * ma_root means we have searched the tree and the entry we found lives in * the root of the tree (ie it has index 0, length 1 and is the only entry in * the tree). * ma_none means we have searched the tree and there is no node in the * tree for this entry. For example, we searched for index 1 in an empty * tree. Or we have a tree which points to a full leaf node and we * searched for an entry which is larger than can be contained in that * leaf node. * ma_pause means the data within the maple state may be stale, restart the * operation * ma_overflow means the search has reached the upper limit of the search * ma_underflow means the search has reached the lower limit of the search * ma_error means there was an error, check the node for the error number. */ enum maple_status { ma_active, ma_start, ma_root, ma_none, ma_pause, ma_overflow, ma_underflow, ma_error, }; /* * The maple state is defined in the struct ma_state and is used to keep track * of information during operations, and even between operations when using the * advanced API. * * If state->node has bit 0 set then it references a tree location which is not * a node (eg the root). If bit 1 is set, the rest of the bits are a negative * errno. Bit 2 (the 'unallocated slots' bit) is clear. Bits 3-6 indicate the * node type. * * state->alloc either has a request number of nodes or an allocated node. If * stat->alloc has a requested number of nodes, the first bit will be set (0x1) * and the remaining bits are the value. If state->alloc is a node, then the * node will be of type maple_alloc. maple_alloc has MAPLE_NODE_SLOTS - 1 for * storing more allocated nodes, a total number of nodes allocated, and the * node_count in this node. node_count is the number of allocated nodes in this * node. The scaling beyond MAPLE_NODE_SLOTS - 1 is handled by storing further * nodes into state->alloc->slot[0]'s node. Nodes are taken from state->alloc * by removing a node from the state->alloc node until state->alloc->node_count * is 1, when state->alloc is returned and the state->alloc->slot[0] is promoted * to state->alloc. Nodes are pushed onto state->alloc by putting the current * state->alloc into the pushed node's slot[0]. * * The state also contains the implied min/max of the state->node, the depth of * this search, and the offset. The implied min/max are either from the parent * node or are 0-oo for the root node. The depth is incremented or decremented * every time a node is walked down or up. The offset is the slot/pivot of * interest in the node - either for reading or writing. * * When returning a value the maple state index and last respectively contain * the start and end of the range for the entry. Ranges are inclusive in the * Maple Tree. * * The status of the state is used to determine how the next action should treat * the state. For instance, if the status is ma_start then the next action * should start at the root of the tree and walk down. If the status is * ma_pause then the node may be stale data and should be discarded. If the * status is ma_overflow, then the last action hit the upper limit. * */ struct ma_state { struct maple_tree *tree; /* The tree we're operating in */ unsigned long index; /* The index we're operating on - range start */ unsigned long last; /* The last index we're operating on - range end */ struct maple_enode *node; /* The node containing this entry */ unsigned long min; /* The minimum index of this node - implied pivot min */ unsigned long max; /* The maximum index of this node - implied pivot max */ struct maple_alloc *alloc; /* Allocated nodes for this operation */ enum maple_status status; /* The status of the state (active, start, none, etc) */ unsigned char depth; /* depth of tree descent during write */ unsigned char offset; unsigned char mas_flags; unsigned char end; /* The end of the node */ enum store_type store_type; /* The type of store needed for this operation */ }; struct ma_wr_state { struct ma_state *mas; struct maple_node *node; /* Decoded mas->node */ unsigned long r_min; /* range min */ unsigned long r_max; /* range max */ enum maple_type type; /* mas->node type */ unsigned char offset_end; /* The offset where the write ends */ unsigned long *pivots; /* mas->node->pivots pointer */ unsigned long end_piv; /* The pivot at the offset end */ void __rcu **slots; /* mas->node->slots pointer */ void *entry; /* The entry to write */ void *content; /* The existing entry that is being overwritten */ unsigned char vacant_height; /* Height of lowest node with free space */ unsigned char sufficient_height;/* Height of lowest node with min sufficiency + 1 nodes */ }; #define mas_lock(mas) spin_lock(&((mas)->tree->ma_lock)) #define mas_lock_nested(mas, subclass) \ spin_lock_nested(&((mas)->tree->ma_lock), subclass) #define mas_unlock(mas) spin_unlock(&((mas)->tree->ma_lock)) /* * Special values for ma_state.node. * MA_ERROR represents an errno. After dropping the lock and attempting * to resolve the error, the walk would have to be restarted from the * top of the tree as the tree may have been modified. */ #define MA_ERROR(err) \ ((struct maple_enode *)(((unsigned long)err << 2) | 2UL)) #define MA_STATE(name, mt, first, end) \ struct ma_state name = { \ .tree = mt, \ .index = first, \ .last = end, \ .node = NULL, \ .status = ma_start, \ .min = 0, \ .max = ULONG_MAX, \ .alloc = NULL, \ .mas_flags = 0, \ .store_type = wr_invalid, \ } #define MA_WR_STATE(name, ma_state, wr_entry) \ struct ma_wr_state name = { \ .mas = ma_state, \ .content = NULL, \ .entry = wr_entry, \ .vacant_height = 0, \ .sufficient_height = 0 \ } #define MA_TOPIARY(name, tree) \ struct ma_topiary name = { \ .head = NULL, \ .tail = NULL, \ .mtree = tree, \ } void *mas_walk(struct ma_state *mas); void *mas_store(struct ma_state *mas, void *entry); void *mas_erase(struct ma_state *mas); int mas_store_gfp(struct ma_state *mas, void *entry, gfp_t gfp); void mas_store_prealloc(struct ma_state *mas, void *entry); void *mas_find(struct ma_state *mas, unsigned long max); void *mas_find_range(struct ma_state *mas, unsigned long max); void *mas_find_rev(struct ma_state *mas, unsigned long min); void *mas_find_range_rev(struct ma_state *mas, unsigned long max); int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp); int mas_alloc_cyclic(struct ma_state *mas, unsigned long *startp, void *entry, unsigned long range_lo, unsigned long range_hi, unsigned long *next, gfp_t gfp); bool mas_nomem(struct ma_state *mas, gfp_t gfp); void mas_pause(struct ma_state *mas); void maple_tree_init(void); void mas_destroy(struct ma_state *mas); int mas_expected_entries(struct ma_state *mas, unsigned long nr_entries); void *mas_prev(struct ma_state *mas, unsigned long min); void *mas_prev_range(struct ma_state *mas, unsigned long max); void *mas_next(struct ma_state *mas, unsigned long max); void *mas_next_range(struct ma_state *mas, unsigned long max); int mas_empty_area(struct ma_state *mas, unsigned long min, unsigned long max, unsigned long size); /* * This finds an empty area from the highest address to the lowest. * AKA "Topdown" version, */ int mas_empty_area_rev(struct ma_state *mas, unsigned long min, unsigned long max, unsigned long size); static inline void mas_init(struct ma_state *mas, struct maple_tree *tree, unsigned long addr) { memset(mas, 0, sizeof(struct ma_state)); mas->tree = tree; mas->index = mas->last = addr; mas->max = ULONG_MAX; mas->status = ma_start; mas->node = NULL; } static inline bool mas_is_active(struct ma_state *mas) { return mas->status == ma_active; } static inline bool mas_is_err(struct ma_state *mas) { return mas->status == ma_error; } /** * mas_reset() - Reset a Maple Tree operation state. * @mas: Maple Tree operation state. * * Resets the error or walk state of the @mas so future walks of the * array will start from the root. Use this if you have dropped the * lock and want to reuse the ma_state. * * Context: Any context. */ static __always_inline void mas_reset(struct ma_state *mas) { mas->status = ma_start; mas->node = NULL; } /** * mas_for_each() - Iterate over a range of the maple tree. * @__mas: Maple Tree operation state (maple_state) * @__entry: Entry retrieved from the tree * @__max: maximum index to retrieve from the tree * * When returned, mas->index and mas->last will hold the entire range for the * entry. * * Note: may return the zero entry. */ #define mas_for_each(__mas, __entry, __max) \ while (((__entry) = mas_find((__mas), (__max))) != NULL) /** * mas_for_each_rev() - Iterate over a range of the maple tree in reverse order. * @__mas: Maple Tree operation state (maple_state) * @__entry: Entry retrieved from the tree * @__min: minimum index to retrieve from the tree * * When returned, mas->index and mas->last will hold the entire range for the * entry. * * Note: may return the zero entry. */ #define mas_for_each_rev(__mas, __entry, __min) \ while (((__entry) = mas_find_rev((__mas), (__min))) != NULL) #ifdef CONFIG_DEBUG_MAPLE_TREE enum mt_dump_format { mt_dump_dec, mt_dump_hex, }; extern atomic_t maple_tree_tests_run; extern atomic_t maple_tree_tests_passed; void mt_dump(const struct maple_tree *mt, enum mt_dump_format format); void mas_dump(const struct ma_state *mas); void mas_wr_dump(const struct ma_wr_state *wr_mas); void mt_validate(struct maple_tree *mt); void mt_cache_shrink(void); #define MT_BUG_ON(__tree, __x) do { \ atomic_inc(&maple_tree_tests_run); \ if (__x) { \ pr_info("BUG at %s:%d (%u)\n", \ __func__, __LINE__, __x); \ mt_dump(__tree, mt_dump_hex); \ pr_info("Pass: %u Run:%u\n", \ atomic_read(&maple_tree_tests_passed), \ atomic_read(&maple_tree_tests_run)); \ dump_stack(); \ } else { \ atomic_inc(&maple_tree_tests_passed); \ } \ } while (0) #define MAS_BUG_ON(__mas, __x) do { \ atomic_inc(&maple_tree_tests_run); \ if (__x) { \ pr_info("BUG at %s:%d (%u)\n", \ __func__, __LINE__, __x); \ mas_dump(__mas); \ mt_dump((__mas)->tree, mt_dump_hex); \ pr_info("Pass: %u Run:%u\n", \ atomic_read(&maple_tree_tests_passed), \ atomic_read(&maple_tree_tests_run)); \ dump_stack(); \ } else { \ atomic_inc(&maple_tree_tests_passed); \ } \ } while (0) #define MAS_WR_BUG_ON(__wrmas, __x) do { \ atomic_inc(&maple_tree_tests_run); \ if (__x) { \ pr_info("BUG at %s:%d (%u)\n", \ __func__, __LINE__, __x); \ mas_wr_dump(__wrmas); \ mas_dump((__wrmas)->mas); \ mt_dump((__wrmas)->mas->tree, mt_dump_hex); \ pr_info("Pass: %u Run:%u\n", \ atomic_read(&maple_tree_tests_passed), \ atomic_read(&maple_tree_tests_run)); \ dump_stack(); \ } else { \ atomic_inc(&maple_tree_tests_passed); \ } \ } while (0) #define MT_WARN_ON(__tree, __x) ({ \ int ret = !!(__x); \ atomic_inc(&maple_tree_tests_run); \ if (ret) { \ pr_info("WARN at %s:%d (%u)\n", \ __func__, __LINE__, __x); \ mt_dump(__tree, mt_dump_hex); \ pr_info("Pass: %u Run:%u\n", \ atomic_read(&maple_tree_tests_passed), \ atomic_read(&maple_tree_tests_run)); \ dump_stack(); \ } else { \ atomic_inc(&maple_tree_tests_passed); \ } \ unlikely(ret); \ }) #define MAS_WARN_ON(__mas, __x) ({ \ int ret = !!(__x); \ atomic_inc(&maple_tree_tests_run); \ if (ret) { \ pr_info("WARN at %s:%d (%u)\n", \ __func__, __LINE__, __x); \ mas_dump(__mas); \ mt_dump((__mas)->tree, mt_dump_hex); \ pr_info("Pass: %u Run:%u\n", \ atomic_read(&maple_tree_tests_passed), \ atomic_read(&maple_tree_tests_run)); \ dump_stack(); \ } else { \ atomic_inc(&maple_tree_tests_passed); \ } \ unlikely(ret); \ }) #define MAS_WR_WARN_ON(__wrmas, __x) ({ \ int ret = !!(__x); \ atomic_inc(&maple_tree_tests_run); \ if (ret) { \ pr_info("WARN at %s:%d (%u)\n", \ __func__, __LINE__, __x); \ mas_wr_dump(__wrmas); \ mas_dump((__wrmas)->mas); \ mt_dump((__wrmas)->mas->tree, mt_dump_hex); \ pr_info("Pass: %u Run:%u\n", \ atomic_read(&maple_tree_tests_passed), \ atomic_read(&maple_tree_tests_run)); \ dump_stack(); \ } else { \ atomic_inc(&maple_tree_tests_passed); \ } \ unlikely(ret); \ }) #else #define MT_BUG_ON(__tree, __x) BUG_ON(__x) #define MAS_BUG_ON(__mas, __x) BUG_ON(__x) #define MAS_WR_BUG_ON(__mas, __x) BUG_ON(__x) #define MT_WARN_ON(__tree, __x) WARN_ON(__x) #define MAS_WARN_ON(__mas, __x) WARN_ON(__x) #define MAS_WR_WARN_ON(__mas, __x) WARN_ON(__x) #endif /* CONFIG_DEBUG_MAPLE_TREE */ /** * __mas_set_range() - Set up Maple Tree operation state to a sub-range of the * current location. * @mas: Maple Tree operation state. * @start: New start of range in the Maple Tree. * @last: New end of range in the Maple Tree. * * set the internal maple state values to a sub-range. * Please use mas_set_range() if you do not know where you are in the tree. */ static inline void __mas_set_range(struct ma_state *mas, unsigned long start, unsigned long last) { /* Ensure the range starts within the current slot */ MAS_WARN_ON(mas, mas_is_active(mas) && (mas->index > start || mas->last < start)); mas->index = start; mas->last = last; } /** * mas_set_range() - Set up Maple Tree operation state for a different index. * @mas: Maple Tree operation state. * @start: New start of range in the Maple Tree. * @last: New end of range in the Maple Tree. * * Move the operation state to refer to a different range. This will * have the effect of starting a walk from the top; see mas_next() * to move to an adjacent index. */ static inline void mas_set_range(struct ma_state *mas, unsigned long start, unsigned long last) { mas_reset(mas); __mas_set_range(mas, start, last); } /** * mas_set() - Set up Maple Tree operation state for a different index. * @mas: Maple Tree operation state. * @index: New index into the Maple Tree. * * Move the operation state to refer to a different index. This will * have the effect of starting a walk from the top; see mas_next() * to move to an adjacent index. */ static inline void mas_set(struct ma_state *mas, unsigned long index) { mas_set_range(mas, index, index); } static inline bool mt_external_lock(const struct maple_tree *mt) { return (mt->ma_flags & MT_FLAGS_LOCK_MASK) == MT_FLAGS_LOCK_EXTERN; } /** * mt_init_flags() - Initialise an empty maple tree with flags. * @mt: Maple Tree * @flags: maple tree flags. * * If you need to initialise a Maple Tree with special flags (eg, an * allocation tree), use this function. * * Context: Any context. */ static inline void mt_init_flags(struct maple_tree *mt, unsigned int flags) { mt->ma_flags = flags; if (!mt_external_lock(mt)) spin_lock_init(&mt->ma_lock); rcu_assign_pointer(mt->ma_root, NULL); } /** * mt_init() - Initialise an empty maple tree. * @mt: Maple Tree * * An empty Maple Tree. * * Context: Any context. */ static inline void mt_init(struct maple_tree *mt) { mt_init_flags(mt, 0); } static inline bool mt_in_rcu(struct maple_tree *mt) { #ifdef CONFIG_MAPLE_RCU_DISABLED return false; #endif return mt->ma_flags & MT_FLAGS_USE_RCU; } /** * mt_clear_in_rcu() - Switch the tree to non-RCU mode. * @mt: The Maple Tree */ static inline void mt_clear_in_rcu(struct maple_tree *mt) { if (!mt_in_rcu(mt)) return; if (mt_external_lock(mt)) { WARN_ON(!mt_lock_is_held(mt)); mt->ma_flags &= ~MT_FLAGS_USE_RCU; } else { mtree_lock(mt); mt->ma_flags &= ~MT_FLAGS_USE_RCU; mtree_unlock(mt); } } /** * mt_set_in_rcu() - Switch the tree to RCU safe mode. * @mt: The Maple Tree */ static inline void mt_set_in_rcu(struct maple_tree *mt) { if (mt_in_rcu(mt)) return; if (mt_external_lock(mt)) { WARN_ON(!mt_lock_is_held(mt)); mt->ma_flags |= MT_FLAGS_USE_RCU; } else { mtree_lock(mt); mt->ma_flags |= MT_FLAGS_USE_RCU; mtree_unlock(mt); } } static inline unsigned int mt_height(const struct maple_tree *mt) { return (mt->ma_flags & MT_FLAGS_HEIGHT_MASK) >> MT_FLAGS_HEIGHT_OFFSET; } void *mt_find(struct maple_tree *mt, unsigned long *index, unsigned long max); void *mt_find_after(struct maple_tree *mt, unsigned long *index, unsigned long max); void *mt_prev(struct maple_tree *mt, unsigned long index, unsigned long min); void *mt_next(struct maple_tree *mt, unsigned long index, unsigned long max); /** * mt_for_each - Iterate over each entry starting at index until max. * @__tree: The Maple Tree * @__entry: The current entry * @__index: The index to start the search from. Subsequently used as iterator. * @__max: The maximum limit for @index * * This iterator skips all entries, which resolve to a NULL pointer, * e.g. entries which has been reserved with XA_ZERO_ENTRY. */ #define mt_for_each(__tree, __entry, __index, __max) \ for (__entry = mt_find(__tree, &(__index), __max); \ __entry; __entry = mt_find_after(__tree, &(__index), __max)) #endif /*_LINUX_MAPLE_TREE_H */ |
| 354 98 260 354 35 22 22 59 26 18 17 99 1 5 7 59 53 3 31 11 24 11 23 59 21 32 85 4233 2988 2 3 225 1678 81 893 85 872 867 872 867 2 426 447 311 196 382 1 869 870 17 15 2 3 3 121 121 14 358 356 356 358 359 359 2 358 358 358 358 44 318 6 6 3 3 3 3 4 4 4 1 4 352 1 349 1 350 350 300 58 351 4 2 2 2 2 2 357 58 303 357 11 52 13 13 1 1 11 2 6 2 6 1 5 4 2 2 2 3 2 5 1 5 5 5 2 13 13 1 1 2 9 2 4 3 2 2 4 1 3 2 2 1 1 1 26 26 26 50 49 4 2 43 1 43 44 37 11 38 1 18 8 2 19 32 6 9 34 32 11 43 23 38 5 42 41 15 23 23 23 10 21 19 23 34 2 6 5 1 2 16 18 26 26 43 43 43 39 39 3 6 25 4 4 4 120 1 121 121 117 118 119 3 2 13 13 1 50 39 4712 4595 417 1364 1342 1340 1134 1112 1111 868 387 986 383 1396 4 486 4 485 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Neighbour Discovery for IPv6 * Linux INET6 implementation * * Authors: * Pedro Roque <roque@di.fc.ul.pt> * Mike Shaver <shaver@ingenia.com> */ /* * Changes: * * Alexey I. Froloff : RFC6106 (DNSSL) support * Pierre Ynard : export userland ND options * through netlink (RDNSS support) * Lars Fenneberg : fixed MTU setting on receipt * of an RA. * Janos Farkas : kmalloc failure checks * Alexey Kuznetsov : state machine reworked * and moved to net/core. * Pekka Savola : RFC2461 validation * YOSHIFUJI Hideaki @USAGI : Verify ND options properly */ #define pr_fmt(fmt) "ICMPv6: " fmt #include <linux/module.h> #include <linux/errno.h> #include <linux/types.h> #include <linux/socket.h> #include <linux/sockios.h> #include <linux/sched.h> #include <linux/net.h> #include <linux/in6.h> #include <linux/route.h> #include <linux/init.h> #include <linux/rcupdate.h> #include <linux/slab.h> #ifdef CONFIG_SYSCTL #include <linux/sysctl.h> #endif #include <linux/if_addr.h> #include <linux/if_ether.h> #include <linux/if_arp.h> #include <linux/ipv6.h> #include <linux/icmpv6.h> #include <linux/jhash.h> #include <net/sock.h> #include <net/snmp.h> #include <net/ipv6.h> #include <net/protocol.h> #include <net/ndisc.h> #include <net/ip6_route.h> #include <net/addrconf.h> #include <net/icmp.h> #include <net/netlink.h> #include <linux/rtnetlink.h> #include <net/flow.h> #include <net/ip6_checksum.h> #include <net/inet_common.h> #include <linux/proc_fs.h> #include <linux/netfilter.h> #include <linux/netfilter_ipv6.h> static u32 ndisc_hash(const void *pkey, const struct net_device *dev, __u32 *hash_rnd); static bool ndisc_key_eq(const struct neighbour *neigh, const void *pkey); static bool ndisc_allow_add(const struct net_device *dev, struct netlink_ext_ack *extack); static int ndisc_constructor(struct neighbour *neigh); static void ndisc_solicit(struct neighbour *neigh, struct sk_buff *skb); static void ndisc_error_report(struct neighbour *neigh, struct sk_buff *skb); static int pndisc_constructor(struct pneigh_entry *n); static void pndisc_destructor(struct pneigh_entry *n); static void pndisc_redo(struct sk_buff *skb); static int ndisc_is_multicast(const void *pkey); static const struct neigh_ops ndisc_generic_ops = { .family = AF_INET6, .solicit = ndisc_solicit, .error_report = ndisc_error_report, .output = neigh_resolve_output, .connected_output = neigh_connected_output, }; static const struct neigh_ops ndisc_hh_ops = { .family = AF_INET6, .solicit = ndisc_solicit, .error_report = ndisc_error_report, .output = neigh_resolve_output, .connected_output = neigh_resolve_output, }; static const struct neigh_ops ndisc_direct_ops = { .family = AF_INET6, .output = neigh_direct_output, .connected_output = neigh_direct_output, }; struct neigh_table nd_tbl = { .family = AF_INET6, .key_len = sizeof(struct in6_addr), .protocol = cpu_to_be16(ETH_P_IPV6), .hash = ndisc_hash, .key_eq = ndisc_key_eq, .constructor = ndisc_constructor, .pconstructor = pndisc_constructor, .pdestructor = pndisc_destructor, .proxy_redo = pndisc_redo, .is_multicast = ndisc_is_multicast, .allow_add = ndisc_allow_add, .id = "ndisc_cache", .parms = { .tbl = &nd_tbl, .reachable_time = ND_REACHABLE_TIME, .data = { [NEIGH_VAR_MCAST_PROBES] = 3, [NEIGH_VAR_UCAST_PROBES] = 3, [NEIGH_VAR_RETRANS_TIME] = ND_RETRANS_TIMER, [NEIGH_VAR_BASE_REACHABLE_TIME] = ND_REACHABLE_TIME, [NEIGH_VAR_DELAY_PROBE_TIME] = 5 * HZ, [NEIGH_VAR_INTERVAL_PROBE_TIME_MS] = 5 * HZ, [NEIGH_VAR_GC_STALETIME] = 60 * HZ, [NEIGH_VAR_QUEUE_LEN_BYTES] = SK_WMEM_MAX, [NEIGH_VAR_PROXY_QLEN] = 64, [NEIGH_VAR_ANYCAST_DELAY] = 1 * HZ, [NEIGH_VAR_PROXY_DELAY] = (8 * HZ) / 10, }, }, .gc_interval = 30 * HZ, .gc_thresh1 = 128, .gc_thresh2 = 512, .gc_thresh3 = 1024, }; EXPORT_SYMBOL_GPL(nd_tbl); void __ndisc_fill_addr_option(struct sk_buff *skb, int type, const void *data, int data_len, int pad) { int space = __ndisc_opt_addr_space(data_len, pad); u8 *opt = skb_put(skb, space); opt[0] = type; opt[1] = space>>3; memset(opt + 2, 0, pad); opt += pad; space -= pad; memcpy(opt+2, data, data_len); data_len += 2; opt += data_len; space -= data_len; if (space > 0) memset(opt, 0, space); } EXPORT_SYMBOL_GPL(__ndisc_fill_addr_option); static inline void ndisc_fill_addr_option(struct sk_buff *skb, int type, const void *data, u8 icmp6_type) { __ndisc_fill_addr_option(skb, type, data, skb->dev->addr_len, ndisc_addr_option_pad(skb->dev->type)); ndisc_ops_fill_addr_option(skb->dev, skb, icmp6_type); } static inline void ndisc_fill_redirect_addr_option(struct sk_buff *skb, void *ha, const u8 *ops_data) { ndisc_fill_addr_option(skb, ND_OPT_TARGET_LL_ADDR, ha, NDISC_REDIRECT); ndisc_ops_fill_redirect_addr_option(skb->dev, skb, ops_data); } static struct nd_opt_hdr *ndisc_next_option(struct nd_opt_hdr *cur, struct nd_opt_hdr *end) { int type; if (!cur || !end || cur >= end) return NULL; type = cur->nd_opt_type; do { cur = ((void *)cur) + (cur->nd_opt_len << 3); } while (cur < end && cur->nd_opt_type != type); return cur <= end && cur->nd_opt_type == type ? cur : NULL; } static inline int ndisc_is_useropt(const struct net_device *dev, struct nd_opt_hdr *opt) { return opt->nd_opt_type == ND_OPT_PREFIX_INFO || opt->nd_opt_type == ND_OPT_RDNSS || opt->nd_opt_type == ND_OPT_DNSSL || opt->nd_opt_type == ND_OPT_6CO || opt->nd_opt_type == ND_OPT_CAPTIVE_PORTAL || opt->nd_opt_type == ND_OPT_PREF64; } static struct nd_opt_hdr *ndisc_next_useropt(const struct net_device *dev, struct nd_opt_hdr *cur, struct nd_opt_hdr *end) { if (!cur || !end || cur >= end) return NULL; do { cur = ((void *)cur) + (cur->nd_opt_len << 3); } while (cur < end && !ndisc_is_useropt(dev, cur)); return cur <= end && ndisc_is_useropt(dev, cur) ? cur : NULL; } struct ndisc_options *ndisc_parse_options(const struct net_device *dev, u8 *opt, int opt_len, struct ndisc_options *ndopts) { struct nd_opt_hdr *nd_opt = (struct nd_opt_hdr *)opt; if (!nd_opt || opt_len < 0 || !ndopts) return NULL; memset(ndopts, 0, sizeof(*ndopts)); while (opt_len) { bool unknown = false; int l; if (opt_len < sizeof(struct nd_opt_hdr)) return NULL; l = nd_opt->nd_opt_len << 3; if (opt_len < l || l == 0) return NULL; if (ndisc_ops_parse_options(dev, nd_opt, ndopts)) goto next_opt; switch (nd_opt->nd_opt_type) { case ND_OPT_SOURCE_LL_ADDR: case ND_OPT_TARGET_LL_ADDR: case ND_OPT_MTU: case ND_OPT_NONCE: case ND_OPT_REDIRECT_HDR: if (ndopts->nd_opt_array[nd_opt->nd_opt_type]) { net_dbg_ratelimited("%s: duplicated ND6 option found: type=%d\n", __func__, nd_opt->nd_opt_type); } else { ndopts->nd_opt_array[nd_opt->nd_opt_type] = nd_opt; } break; case ND_OPT_PREFIX_INFO: ndopts->nd_opts_pi_end = nd_opt; if (!ndopts->nd_opt_array[nd_opt->nd_opt_type]) ndopts->nd_opt_array[nd_opt->nd_opt_type] = nd_opt; break; #ifdef CONFIG_IPV6_ROUTE_INFO case ND_OPT_ROUTE_INFO: ndopts->nd_opts_ri_end = nd_opt; if (!ndopts->nd_opts_ri) ndopts->nd_opts_ri = nd_opt; break; #endif default: unknown = true; } if (ndisc_is_useropt(dev, nd_opt)) { ndopts->nd_useropts_end = nd_opt; if (!ndopts->nd_useropts) ndopts->nd_useropts = nd_opt; } else if (unknown) { /* * Unknown options must be silently ignored, * to accommodate future extension to the * protocol. */ net_dbg_ratelimited("%s: ignored unsupported option; type=%d, len=%d\n", __func__, nd_opt->nd_opt_type, nd_opt->nd_opt_len); } next_opt: opt_len -= l; nd_opt = ((void *)nd_opt) + l; } return ndopts; } int ndisc_mc_map(const struct in6_addr *addr, char *buf, struct net_device *dev, int dir) { switch (dev->type) { case ARPHRD_ETHER: case ARPHRD_IEEE802: /* Not sure. Check it later. --ANK */ case ARPHRD_FDDI: ipv6_eth_mc_map(addr, buf); return 0; case ARPHRD_ARCNET: ipv6_arcnet_mc_map(addr, buf); return 0; case ARPHRD_INFINIBAND: ipv6_ib_mc_map(addr, dev->broadcast, buf); return 0; case ARPHRD_IPGRE: return ipv6_ipgre_mc_map(addr, dev->broadcast, buf); default: if (dir) { memcpy(buf, dev->broadcast, dev->addr_len); return 0; } } return -EINVAL; } EXPORT_SYMBOL(ndisc_mc_map); static u32 ndisc_hash(const void *pkey, const struct net_device *dev, __u32 *hash_rnd) { return ndisc_hashfn(pkey, dev, hash_rnd); } static bool ndisc_key_eq(const struct neighbour *n, const void *pkey) { return neigh_key_eq128(n, pkey); } static int ndisc_constructor(struct neighbour *neigh) { struct in6_addr *addr = (struct in6_addr *)&neigh->primary_key; struct net_device *dev = neigh->dev; struct inet6_dev *in6_dev; struct neigh_parms *parms; bool is_multicast = ipv6_addr_is_multicast(addr); in6_dev = in6_dev_get(dev); if (!in6_dev) { return -EINVAL; } parms = in6_dev->nd_parms; __neigh_parms_put(neigh->parms); neigh->parms = neigh_parms_clone(parms); neigh->type = is_multicast ? RTN_MULTICAST : RTN_UNICAST; if (!dev->header_ops) { neigh->nud_state = NUD_NOARP; neigh->ops = &ndisc_direct_ops; neigh->output = neigh_direct_output; } else { if (is_multicast) { neigh->nud_state = NUD_NOARP; ndisc_mc_map(addr, neigh->ha, dev, 1); } else if (dev->flags&(IFF_NOARP|IFF_LOOPBACK)) { neigh->nud_state = NUD_NOARP; memcpy(neigh->ha, dev->dev_addr, dev->addr_len); if (dev->flags&IFF_LOOPBACK) neigh->type = RTN_LOCAL; } else if (dev->flags&IFF_POINTOPOINT) { neigh->nud_state = NUD_NOARP; memcpy(neigh->ha, dev->broadcast, dev->addr_len); } if (dev->header_ops->cache) neigh->ops = &ndisc_hh_ops; else neigh->ops = &ndisc_generic_ops; if (neigh->nud_state&NUD_VALID) neigh->output = neigh->ops->connected_output; else neigh->output = neigh->ops->output; } in6_dev_put(in6_dev); return 0; } static int pndisc_constructor(struct pneigh_entry *n) { struct in6_addr *addr = (struct in6_addr *)&n->key; struct net_device *dev = n->dev; struct in6_addr maddr; if (!dev) return -EINVAL; addrconf_addr_solict_mult(addr, &maddr); return ipv6_dev_mc_inc(dev, &maddr); } static void pndisc_destructor(struct pneigh_entry *n) { struct in6_addr *addr = (struct in6_addr *)&n->key; struct net_device *dev = n->dev; struct in6_addr maddr; if (!dev) return; addrconf_addr_solict_mult(addr, &maddr); ipv6_dev_mc_dec(dev, &maddr); } /* called with rtnl held */ static bool ndisc_allow_add(const struct net_device *dev, struct netlink_ext_ack *extack) { struct inet6_dev *idev = __in6_dev_get(dev); if (!idev || idev->cnf.disable_ipv6) { NL_SET_ERR_MSG(extack, "IPv6 is disabled on this device"); return false; } return true; } static struct sk_buff *ndisc_alloc_skb(struct net_device *dev, int len) { int hlen = LL_RESERVED_SPACE(dev); int tlen = dev->needed_tailroom; struct sk_buff *skb; skb = alloc_skb(hlen + sizeof(struct ipv6hdr) + len + tlen, GFP_ATOMIC); if (!skb) return NULL; skb->protocol = htons(ETH_P_IPV6); skb->dev = dev; skb_reserve(skb, hlen + sizeof(struct ipv6hdr)); skb_reset_transport_header(skb); /* Manually assign socket ownership as we avoid calling * sock_alloc_send_pskb() to bypass wmem buffer limits */ rcu_read_lock(); skb_set_owner_w(skb, dev_net_rcu(dev)->ipv6.ndisc_sk); rcu_read_unlock(); return skb; } static void ip6_nd_hdr(struct sk_buff *skb, const struct in6_addr *saddr, const struct in6_addr *daddr, int hop_limit, int len) { struct ipv6hdr *hdr; struct inet6_dev *idev; unsigned tclass; rcu_read_lock(); idev = __in6_dev_get(skb->dev); tclass = idev ? READ_ONCE(idev->cnf.ndisc_tclass) : 0; rcu_read_unlock(); skb_push(skb, sizeof(*hdr)); skb_reset_network_header(skb); hdr = ipv6_hdr(skb); ip6_flow_hdr(hdr, tclass, 0); hdr->payload_len = htons(len); hdr->nexthdr = IPPROTO_ICMPV6; hdr->hop_limit = hop_limit; hdr->saddr = *saddr; hdr->daddr = *daddr; } void ndisc_send_skb(struct sk_buff *skb, const struct in6_addr *daddr, const struct in6_addr *saddr) { struct icmp6hdr *icmp6h = icmp6_hdr(skb); struct dst_entry *dst = skb_dst(skb); struct net_device *dev; struct inet6_dev *idev; struct net *net; struct sock *sk; int err; u8 type; type = icmp6h->icmp6_type; rcu_read_lock(); net = dev_net_rcu(skb->dev); sk = net->ipv6.ndisc_sk; if (!dst) { struct flowi6 fl6; int oif = skb->dev->ifindex; icmpv6_flow_init(sk, &fl6, type, saddr, daddr, oif); dst = icmp6_dst_alloc(skb->dev, &fl6); if (IS_ERR(dst)) { rcu_read_unlock(); kfree_skb(skb); return; } skb_dst_set(skb, dst); } icmp6h->icmp6_cksum = csum_ipv6_magic(saddr, daddr, skb->len, IPPROTO_ICMPV6, csum_partial(icmp6h, skb->len, 0)); ip6_nd_hdr(skb, saddr, daddr, READ_ONCE(inet6_sk(sk)->hop_limit), skb->len); dev = dst_dev(dst); idev = __in6_dev_get(dev); IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTREQUESTS); err = NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_OUT, net, sk, skb, NULL, dev, dst_output); if (!err) { ICMP6MSGOUT_INC_STATS(net, idev, type); ICMP6_INC_STATS(net, idev, ICMP6_MIB_OUTMSGS); } rcu_read_unlock(); } EXPORT_SYMBOL(ndisc_send_skb); void ndisc_send_na(struct net_device *dev, const struct in6_addr *daddr, const struct in6_addr *solicited_addr, bool router, bool solicited, bool override, bool inc_opt) { struct sk_buff *skb; struct in6_addr tmpaddr; struct inet6_ifaddr *ifp; const struct in6_addr *src_addr; struct nd_msg *msg; int optlen = 0; /* for anycast or proxy, solicited_addr != src_addr */ ifp = ipv6_get_ifaddr(dev_net(dev), solicited_addr, dev, 1); if (ifp) { src_addr = solicited_addr; if (ifp->flags & IFA_F_OPTIMISTIC) override = false; inc_opt |= READ_ONCE(ifp->idev->cnf.force_tllao); in6_ifa_put(ifp); } else { if (ipv6_dev_get_saddr(dev_net(dev), dev, daddr, inet6_sk(dev_net(dev)->ipv6.ndisc_sk)->srcprefs, &tmpaddr)) return; src_addr = &tmpaddr; } if (!dev->addr_len) inc_opt = false; if (inc_opt) optlen += ndisc_opt_addr_space(dev, NDISC_NEIGHBOUR_ADVERTISEMENT); skb = ndisc_alloc_skb(dev, sizeof(*msg) + optlen); if (!skb) return; msg = skb_put(skb, sizeof(*msg)); *msg = (struct nd_msg) { .icmph = { .icmp6_type = NDISC_NEIGHBOUR_ADVERTISEMENT, .icmp6_router = router, .icmp6_solicited = solicited, .icmp6_override = override, }, .target = *solicited_addr, }; if (inc_opt) ndisc_fill_addr_option(skb, ND_OPT_TARGET_LL_ADDR, dev->dev_addr, NDISC_NEIGHBOUR_ADVERTISEMENT); ndisc_send_skb(skb, daddr, src_addr); } static void ndisc_send_unsol_na(struct net_device *dev) { struct inet6_dev *idev; struct inet6_ifaddr *ifa; idev = in6_dev_get(dev); if (!idev) return; read_lock_bh(&idev->lock); list_for_each_entry(ifa, &idev->addr_list, if_list) { /* skip tentative addresses until dad completes */ if (ifa->flags & IFA_F_TENTATIVE && !(ifa->flags & IFA_F_OPTIMISTIC)) continue; ndisc_send_na(dev, &in6addr_linklocal_allnodes, &ifa->addr, /*router=*/ !!idev->cnf.forwarding, /*solicited=*/ false, /*override=*/ true, /*inc_opt=*/ true); } read_unlock_bh(&idev->lock); in6_dev_put(idev); } struct sk_buff *ndisc_ns_create(struct net_device *dev, const struct in6_addr *solicit, const struct in6_addr *saddr, u64 nonce) { int inc_opt = dev->addr_len; struct sk_buff *skb; struct nd_msg *msg; int optlen = 0; if (!saddr) return NULL; if (ipv6_addr_any(saddr)) inc_opt = false; if (inc_opt) optlen += ndisc_opt_addr_space(dev, NDISC_NEIGHBOUR_SOLICITATION); if (nonce != 0) optlen += 8; skb = ndisc_alloc_skb(dev, sizeof(*msg) + optlen); if (!skb) return NULL; msg = skb_put(skb, sizeof(*msg)); *msg = (struct nd_msg) { .icmph = { .icmp6_type = NDISC_NEIGHBOUR_SOLICITATION, }, .target = *solicit, }; if (inc_opt) ndisc_fill_addr_option(skb, ND_OPT_SOURCE_LL_ADDR, dev->dev_addr, NDISC_NEIGHBOUR_SOLICITATION); if (nonce != 0) { u8 *opt = skb_put(skb, 8); opt[0] = ND_OPT_NONCE; opt[1] = 8 >> 3; memcpy(opt + 2, &nonce, 6); } return skb; } EXPORT_SYMBOL(ndisc_ns_create); void ndisc_send_ns(struct net_device *dev, const struct in6_addr *solicit, const struct in6_addr *daddr, const struct in6_addr *saddr, u64 nonce) { struct in6_addr addr_buf; struct sk_buff *skb; if (!saddr) { if (ipv6_get_lladdr(dev, &addr_buf, (IFA_F_TENTATIVE | IFA_F_OPTIMISTIC))) return; saddr = &addr_buf; } skb = ndisc_ns_create(dev, solicit, saddr, nonce); if (skb) ndisc_send_skb(skb, daddr, saddr); } void ndisc_send_rs(struct net_device *dev, const struct in6_addr *saddr, const struct in6_addr *daddr) { struct sk_buff *skb; struct rs_msg *msg; int send_sllao = dev->addr_len; int optlen = 0; #ifdef CONFIG_IPV6_OPTIMISTIC_DAD /* * According to section 2.2 of RFC 4429, we must not * send router solicitations with a sllao from * optimistic addresses, but we may send the solicitation * if we don't include the sllao. So here we check * if our address is optimistic, and if so, we * suppress the inclusion of the sllao. */ if (send_sllao) { struct inet6_ifaddr *ifp = ipv6_get_ifaddr(dev_net(dev), saddr, dev, 1); if (ifp) { if (ifp->flags & IFA_F_OPTIMISTIC) { send_sllao = 0; } in6_ifa_put(ifp); } else { send_sllao = 0; } } #endif if (send_sllao) optlen += ndisc_opt_addr_space(dev, NDISC_ROUTER_SOLICITATION); skb = ndisc_alloc_skb(dev, sizeof(*msg) + optlen); if (!skb) return; msg = skb_put(skb, sizeof(*msg)); *msg = (struct rs_msg) { .icmph = { .icmp6_type = NDISC_ROUTER_SOLICITATION, }, }; if (send_sllao) ndisc_fill_addr_option(skb, ND_OPT_SOURCE_LL_ADDR, dev->dev_addr, NDISC_ROUTER_SOLICITATION); ndisc_send_skb(skb, daddr, saddr); } static void ndisc_error_report(struct neighbour *neigh, struct sk_buff *skb) { /* * "The sender MUST return an ICMP * destination unreachable" */ dst_link_failure(skb); kfree_skb(skb); } /* Called with locked neigh: either read or both */ static void ndisc_solicit(struct neighbour *neigh, struct sk_buff *skb) { struct in6_addr *saddr = NULL; struct in6_addr mcaddr; struct net_device *dev = neigh->dev; struct in6_addr *target = (struct in6_addr *)&neigh->primary_key; int probes = atomic_read(&neigh->probes); if (skb && ipv6_chk_addr_and_flags(dev_net(dev), &ipv6_hdr(skb)->saddr, dev, false, 1, IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) saddr = &ipv6_hdr(skb)->saddr; probes -= NEIGH_VAR(neigh->parms, UCAST_PROBES); if (probes < 0) { if (!(READ_ONCE(neigh->nud_state) & NUD_VALID)) { net_dbg_ratelimited("%s: trying to ucast probe in NUD_INVALID: %pI6\n", __func__, target); } ndisc_send_ns(dev, target, target, saddr, 0); } else if ((probes -= NEIGH_VAR(neigh->parms, APP_PROBES)) < 0) { neigh_app_ns(neigh); } else { addrconf_addr_solict_mult(target, &mcaddr); ndisc_send_ns(dev, target, &mcaddr, saddr, 0); } } static int pndisc_is_router(const void *pkey, struct net_device *dev) { struct pneigh_entry *n; int ret = -1; n = pneigh_lookup(&nd_tbl, dev_net(dev), pkey, dev); if (n) ret = !!(READ_ONCE(n->flags) & NTF_ROUTER); return ret; } void ndisc_update(const struct net_device *dev, struct neighbour *neigh, const u8 *lladdr, u8 new, u32 flags, u8 icmp6_type, struct ndisc_options *ndopts) { neigh_update(neigh, lladdr, new, flags, 0); /* report ndisc ops about neighbour update */ ndisc_ops_update(dev, neigh, flags, icmp6_type, ndopts); } static enum skb_drop_reason ndisc_recv_ns(struct sk_buff *skb) { struct nd_msg *msg = (struct nd_msg *)skb_transport_header(skb); const struct in6_addr *saddr = &ipv6_hdr(skb)->saddr; const struct in6_addr *daddr = &ipv6_hdr(skb)->daddr; u8 *lladdr = NULL; u32 ndoptlen = skb_tail_pointer(skb) - (skb_transport_header(skb) + offsetof(struct nd_msg, opt)); struct ndisc_options ndopts; struct net_device *dev = skb->dev; struct inet6_ifaddr *ifp; struct inet6_dev *idev = NULL; struct neighbour *neigh; int dad = ipv6_addr_any(saddr); int is_router = -1; SKB_DR(reason); u64 nonce = 0; bool inc; if (skb->len < sizeof(struct nd_msg)) return SKB_DROP_REASON_PKT_TOO_SMALL; if (ipv6_addr_is_multicast(&msg->target)) { net_dbg_ratelimited("NS: multicast target address\n"); return reason; } /* * RFC2461 7.1.1: * DAD has to be destined for solicited node multicast address. */ if (dad && !ipv6_addr_is_solict_mult(daddr)) { net_dbg_ratelimited("NS: bad DAD packet (wrong destination)\n"); return reason; } if (!ndisc_parse_options(dev, msg->opt, ndoptlen, &ndopts)) return SKB_DROP_REASON_IPV6_NDISC_BAD_OPTIONS; if (ndopts.nd_opts_src_lladdr) { lladdr = ndisc_opt_addr_data(ndopts.nd_opts_src_lladdr, dev); if (!lladdr) { net_dbg_ratelimited("NS: invalid link-layer address length\n"); return reason; } /* RFC2461 7.1.1: * If the IP source address is the unspecified address, * there MUST NOT be source link-layer address option * in the message. */ if (dad) { net_dbg_ratelimited("NS: bad DAD packet (link-layer address option)\n"); return reason; } } if (ndopts.nd_opts_nonce && ndopts.nd_opts_nonce->nd_opt_len == 1) memcpy(&nonce, (u8 *)(ndopts.nd_opts_nonce + 1), 6); inc = ipv6_addr_is_multicast(daddr); ifp = ipv6_get_ifaddr(dev_net(dev), &msg->target, dev, 1); if (ifp) { have_ifp: if (ifp->flags & (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) { if (dad) { if (nonce != 0 && ifp->dad_nonce == nonce) { u8 *np = (u8 *)&nonce; /* Matching nonce if looped back */ net_dbg_ratelimited("%s: IPv6 DAD loopback for address %pI6c nonce %pM ignored\n", ifp->idev->dev->name, &ifp->addr, np); goto out; } /* * We are colliding with another node * who is doing DAD * so fail our DAD process */ addrconf_dad_failure(skb, ifp); return reason; } else { /* * This is not a dad solicitation. * If we are an optimistic node, * we should respond. * Otherwise, we should ignore it. */ if (!(ifp->flags & IFA_F_OPTIMISTIC)) goto out; } } idev = ifp->idev; } else { struct net *net = dev_net(dev); /* perhaps an address on the master device */ if (netif_is_l3_slave(dev)) { struct net_device *mdev; mdev = netdev_master_upper_dev_get_rcu(dev); if (mdev) { ifp = ipv6_get_ifaddr(net, &msg->target, mdev, 1); if (ifp) goto have_ifp; } } idev = in6_dev_get(dev); if (!idev) { /* XXX: count this drop? */ return reason; } if (ipv6_chk_acast_addr(net, dev, &msg->target) || (READ_ONCE(idev->cnf.forwarding) && (READ_ONCE(net->ipv6.devconf_all->proxy_ndp) || READ_ONCE(idev->cnf.proxy_ndp)) && (is_router = pndisc_is_router(&msg->target, dev)) >= 0)) { if (!(NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED) && skb->pkt_type != PACKET_HOST && inc && NEIGH_VAR(idev->nd_parms, PROXY_DELAY) != 0) { /* * for anycast or proxy, * sender should delay its response * by a random time between 0 and * MAX_ANYCAST_DELAY_TIME seconds. * (RFC2461) -- yoshfuji */ struct sk_buff *n = skb_clone(skb, GFP_ATOMIC); if (n) pneigh_enqueue(&nd_tbl, idev->nd_parms, n); goto out; } } else { SKB_DR_SET(reason, IPV6_NDISC_NS_OTHERHOST); goto out; } } if (is_router < 0) is_router = READ_ONCE(idev->cnf.forwarding); if (dad) { ndisc_send_na(dev, &in6addr_linklocal_allnodes, &msg->target, !!is_router, false, (ifp != NULL), true); goto out; } if (inc) NEIGH_CACHE_STAT_INC(&nd_tbl, rcv_probes_mcast); else NEIGH_CACHE_STAT_INC(&nd_tbl, rcv_probes_ucast); /* * update / create cache entry * for the source address */ neigh = __neigh_lookup(&nd_tbl, saddr, dev, !inc || lladdr || !dev->addr_len); if (neigh) ndisc_update(dev, neigh, lladdr, NUD_STALE, NEIGH_UPDATE_F_WEAK_OVERRIDE| NEIGH_UPDATE_F_OVERRIDE, NDISC_NEIGHBOUR_SOLICITATION, &ndopts); if (neigh || !dev->header_ops) { ndisc_send_na(dev, saddr, &msg->target, !!is_router, true, (ifp != NULL && inc), inc); if (neigh) neigh_release(neigh); reason = SKB_CONSUMED; } out: if (ifp) in6_ifa_put(ifp); else in6_dev_put(idev); return reason; } static int accept_untracked_na(struct net_device *dev, struct in6_addr *saddr) { struct inet6_dev *idev = __in6_dev_get(dev); switch (READ_ONCE(idev->cnf.accept_untracked_na)) { case 0: /* Don't accept untracked na (absent in neighbor cache) */ return 0; case 1: /* Create new entries from na if currently untracked */ return 1; case 2: /* Create new entries from untracked na only if saddr is in the * same subnet as an address configured on the interface that * received the na */ return !!ipv6_chk_prefix(saddr, dev); default: return 0; } } static enum skb_drop_reason ndisc_recv_na(struct sk_buff *skb) { struct nd_msg *msg = (struct nd_msg *)skb_transport_header(skb); struct in6_addr *saddr = &ipv6_hdr(skb)->saddr; const struct in6_addr *daddr = &ipv6_hdr(skb)->daddr; u8 *lladdr = NULL; u32 ndoptlen = skb_tail_pointer(skb) - (skb_transport_header(skb) + offsetof(struct nd_msg, opt)); struct ndisc_options ndopts; struct net_device *dev = skb->dev; struct inet6_dev *idev = __in6_dev_get(dev); struct inet6_ifaddr *ifp; struct neighbour *neigh; SKB_DR(reason); u8 new_state; if (skb->len < sizeof(struct nd_msg)) return SKB_DROP_REASON_PKT_TOO_SMALL; if (ipv6_addr_is_multicast(&msg->target)) { net_dbg_ratelimited("NA: target address is multicast\n"); return reason; } if (ipv6_addr_is_multicast(daddr) && msg->icmph.icmp6_solicited) { net_dbg_ratelimited("NA: solicited NA is multicasted\n"); return reason; } /* For some 802.11 wireless deployments (and possibly other networks), * there will be a NA proxy and unsolicitd packets are attacks * and thus should not be accepted. * drop_unsolicited_na takes precedence over accept_untracked_na */ if (!msg->icmph.icmp6_solicited && idev && READ_ONCE(idev->cnf.drop_unsolicited_na)) return reason; if (!ndisc_parse_options(dev, msg->opt, ndoptlen, &ndopts)) return SKB_DROP_REASON_IPV6_NDISC_BAD_OPTIONS; if (ndopts.nd_opts_tgt_lladdr) { lladdr = ndisc_opt_addr_data(ndopts.nd_opts_tgt_lladdr, dev); if (!lladdr) { net_dbg_ratelimited("NA: invalid link-layer address length\n"); return reason; } } ifp = ipv6_get_ifaddr(dev_net(dev), &msg->target, dev, 1); if (ifp) { if (skb->pkt_type != PACKET_LOOPBACK && (ifp->flags & IFA_F_TENTATIVE)) { addrconf_dad_failure(skb, ifp); return reason; } /* What should we make now? The advertisement is invalid, but ndisc specs say nothing about it. It could be misconfiguration, or an smart proxy agent tries to help us :-) We should not print the error if NA has been received from loopback - it is just our own unsolicited advertisement. */ if (skb->pkt_type != PACKET_LOOPBACK) net_warn_ratelimited("NA: %pM advertised our address %pI6c on %s!\n", eth_hdr(skb)->h_source, &ifp->addr, ifp->idev->dev->name); in6_ifa_put(ifp); return reason; } neigh = neigh_lookup(&nd_tbl, &msg->target, dev); /* RFC 9131 updates original Neighbour Discovery RFC 4861. * NAs with Target LL Address option without a corresponding * entry in the neighbour cache can now create a STALE neighbour * cache entry on routers. * * entry accept fwding solicited behaviour * ------- ------ ------ --------- ---------------------- * present X X 0 Set state to STALE * present X X 1 Set state to REACHABLE * absent 0 X X Do nothing * absent 1 0 X Do nothing * absent 1 1 X Add a new STALE entry * * Note that we don't do a (daddr == all-routers-mcast) check. */ new_state = msg->icmph.icmp6_solicited ? NUD_REACHABLE : NUD_STALE; if (!neigh && lladdr && idev && READ_ONCE(idev->cnf.forwarding)) { if (accept_untracked_na(dev, saddr)) { neigh = neigh_create(&nd_tbl, &msg->target, dev); new_state = NUD_STALE; } } if (neigh && !IS_ERR(neigh)) { u8 old_flags = neigh->flags; struct net *net = dev_net(dev); if (READ_ONCE(neigh->nud_state) & NUD_FAILED) goto out; /* * Don't update the neighbor cache entry on a proxy NA from * ourselves because either the proxied node is off link or it * has already sent a NA to us. */ if (lladdr && !memcmp(lladdr, dev->dev_addr, dev->addr_len) && READ_ONCE(net->ipv6.devconf_all->forwarding) && READ_ONCE(net->ipv6.devconf_all->proxy_ndp) && pneigh_lookup(&nd_tbl, net, &msg->target, dev)) { /* XXX: idev->cnf.proxy_ndp */ goto out; } ndisc_update(dev, neigh, lladdr, new_state, NEIGH_UPDATE_F_WEAK_OVERRIDE| (msg->icmph.icmp6_override ? NEIGH_UPDATE_F_OVERRIDE : 0)| NEIGH_UPDATE_F_OVERRIDE_ISROUTER| (msg->icmph.icmp6_router ? NEIGH_UPDATE_F_ISROUTER : 0), NDISC_NEIGHBOUR_ADVERTISEMENT, &ndopts); if ((old_flags & ~neigh->flags) & NTF_ROUTER) { /* * Change: router to host */ rt6_clean_tohost(dev_net(dev), saddr); } reason = SKB_CONSUMED; out: neigh_release(neigh); } return reason; } static enum skb_drop_reason ndisc_recv_rs(struct sk_buff *skb) { struct rs_msg *rs_msg = (struct rs_msg *)skb_transport_header(skb); unsigned long ndoptlen = skb->len - sizeof(*rs_msg); struct neighbour *neigh; struct inet6_dev *idev; const struct in6_addr *saddr = &ipv6_hdr(skb)->saddr; struct ndisc_options ndopts; u8 *lladdr = NULL; SKB_DR(reason); if (skb->len < sizeof(*rs_msg)) return SKB_DROP_REASON_PKT_TOO_SMALL; idev = __in6_dev_get(skb->dev); if (!idev) { net_err_ratelimited("RS: can't find in6 device\n"); return reason; } /* Don't accept RS if we're not in router mode */ if (!READ_ONCE(idev->cnf.forwarding)) goto out; /* * Don't update NCE if src = ::; * this implies that the source node has no ip address assigned yet. */ if (ipv6_addr_any(saddr)) goto out; /* Parse ND options */ if (!ndisc_parse_options(skb->dev, rs_msg->opt, ndoptlen, &ndopts)) return SKB_DROP_REASON_IPV6_NDISC_BAD_OPTIONS; if (ndopts.nd_opts_src_lladdr) { lladdr = ndisc_opt_addr_data(ndopts.nd_opts_src_lladdr, skb->dev); if (!lladdr) goto out; } neigh = __neigh_lookup(&nd_tbl, saddr, skb->dev, 1); if (neigh) { ndisc_update(skb->dev, neigh, lladdr, NUD_STALE, NEIGH_UPDATE_F_WEAK_OVERRIDE| NEIGH_UPDATE_F_OVERRIDE| NEIGH_UPDATE_F_OVERRIDE_ISROUTER, NDISC_ROUTER_SOLICITATION, &ndopts); neigh_release(neigh); reason = SKB_CONSUMED; } out: return reason; } static void ndisc_ra_useropt(struct sk_buff *ra, struct nd_opt_hdr *opt) { struct icmp6hdr *icmp6h = (struct icmp6hdr *)skb_transport_header(ra); struct sk_buff *skb; struct nlmsghdr *nlh; struct nduseroptmsg *ndmsg; struct net *net = dev_net(ra->dev); int err; int base_size = NLMSG_ALIGN(sizeof(struct nduseroptmsg) + (opt->nd_opt_len << 3)); size_t msg_size = base_size + nla_total_size(sizeof(struct in6_addr)); skb = nlmsg_new(msg_size, GFP_ATOMIC); if (!skb) { err = -ENOBUFS; goto errout; } nlh = nlmsg_put(skb, 0, 0, RTM_NEWNDUSEROPT, base_size, 0); if (!nlh) { goto nla_put_failure; } ndmsg = nlmsg_data(nlh); ndmsg->nduseropt_family = AF_INET6; ndmsg->nduseropt_ifindex = ra->dev->ifindex; ndmsg->nduseropt_icmp_type = icmp6h->icmp6_type; ndmsg->nduseropt_icmp_code = icmp6h->icmp6_code; ndmsg->nduseropt_opts_len = opt->nd_opt_len << 3; memcpy(ndmsg + 1, opt, opt->nd_opt_len << 3); if (nla_put_in6_addr(skb, NDUSEROPT_SRCADDR, &ipv6_hdr(ra)->saddr)) goto nla_put_failure; nlmsg_end(skb, nlh); rtnl_notify(skb, net, 0, RTNLGRP_ND_USEROPT, NULL, GFP_ATOMIC); return; nla_put_failure: nlmsg_free(skb); err = -EMSGSIZE; errout: rtnl_set_sk_err(net, RTNLGRP_ND_USEROPT, err); } static enum skb_drop_reason ndisc_router_discovery(struct sk_buff *skb) { struct ra_msg *ra_msg = (struct ra_msg *)skb_transport_header(skb); bool send_ifinfo_notify = false; struct neighbour *neigh = NULL; struct ndisc_options ndopts; struct fib6_info *rt = NULL; struct inet6_dev *in6_dev; struct fib6_table *table; u32 defrtr_usr_metric; unsigned int pref = 0; __u32 old_if_flags; struct net *net; SKB_DR(reason); int lifetime; int optlen; __u8 *opt = (__u8 *)(ra_msg + 1); optlen = (skb_tail_pointer(skb) - skb_transport_header(skb)) - sizeof(struct ra_msg); net_dbg_ratelimited("RA: %s, dev: %s\n", __func__, skb->dev->name); if (!(ipv6_addr_type(&ipv6_hdr(skb)->saddr) & IPV6_ADDR_LINKLOCAL)) { net_dbg_ratelimited("RA: source address is not link-local\n"); return reason; } if (optlen < 0) return SKB_DROP_REASON_PKT_TOO_SMALL; #ifdef CONFIG_IPV6_NDISC_NODETYPE if (skb->ndisc_nodetype == NDISC_NODETYPE_HOST) { net_dbg_ratelimited("RA: from host or unauthorized router\n"); return reason; } #endif in6_dev = __in6_dev_get(skb->dev); if (!in6_dev) { net_err_ratelimited("RA: can't find inet6 device for %s\n", skb->dev->name); return reason; } if (!ndisc_parse_options(skb->dev, opt, optlen, &ndopts)) return SKB_DROP_REASON_IPV6_NDISC_BAD_OPTIONS; if (!ipv6_accept_ra(in6_dev)) { net_dbg_ratelimited("RA: %s, did not accept ra for dev: %s\n", __func__, skb->dev->name); goto skip_linkparms; } #ifdef CONFIG_IPV6_NDISC_NODETYPE /* skip link-specific parameters from interior routers */ if (skb->ndisc_nodetype == NDISC_NODETYPE_NODEFAULT) { net_dbg_ratelimited("RA: %s, nodetype is NODEFAULT, dev: %s\n", __func__, skb->dev->name); goto skip_linkparms; } #endif if (in6_dev->if_flags & IF_RS_SENT) { /* * flag that an RA was received after an RS was sent * out on this interface. */ in6_dev->if_flags |= IF_RA_RCVD; } /* * Remember the managed/otherconf flags from most recently * received RA message (RFC 2462) -- yoshfuji */ old_if_flags = in6_dev->if_flags; in6_dev->if_flags = (in6_dev->if_flags & ~(IF_RA_MANAGED | IF_RA_OTHERCONF)) | (ra_msg->icmph.icmp6_addrconf_managed ? IF_RA_MANAGED : 0) | (ra_msg->icmph.icmp6_addrconf_other ? IF_RA_OTHERCONF : 0); if (old_if_flags != in6_dev->if_flags) send_ifinfo_notify = true; if (!READ_ONCE(in6_dev->cnf.accept_ra_defrtr)) { net_dbg_ratelimited("RA: %s, defrtr is false for dev: %s\n", __func__, skb->dev->name); goto skip_defrtr; } lifetime = ntohs(ra_msg->icmph.icmp6_rt_lifetime); if (lifetime != 0 && lifetime < READ_ONCE(in6_dev->cnf.accept_ra_min_lft)) { net_dbg_ratelimited("RA: router lifetime (%ds) is too short: %s\n", lifetime, skb->dev->name); goto skip_defrtr; } /* Do not accept RA with source-addr found on local machine unless * accept_ra_from_local is set to true. */ net = dev_net(in6_dev->dev); if (!READ_ONCE(in6_dev->cnf.accept_ra_from_local) && ipv6_chk_addr(net, &ipv6_hdr(skb)->saddr, in6_dev->dev, 0)) { net_dbg_ratelimited("RA from local address detected on dev: %s: default router ignored\n", skb->dev->name); goto skip_defrtr; } #ifdef CONFIG_IPV6_ROUTER_PREF pref = ra_msg->icmph.icmp6_router_pref; /* 10b is handled as if it were 00b (medium) */ if (pref == ICMPV6_ROUTER_PREF_INVALID || !READ_ONCE(in6_dev->cnf.accept_ra_rtr_pref)) pref = ICMPV6_ROUTER_PREF_MEDIUM; #endif /* routes added from RAs do not use nexthop objects */ rt = rt6_get_dflt_router(net, &ipv6_hdr(skb)->saddr, skb->dev); if (rt) { neigh = ip6_neigh_lookup(&rt->fib6_nh->fib_nh_gw6, rt->fib6_nh->fib_nh_dev, NULL, &ipv6_hdr(skb)->saddr); if (!neigh) { net_err_ratelimited("RA: %s got default router without neighbour\n", __func__); fib6_info_release(rt); return reason; } } /* Set default route metric as specified by user */ defrtr_usr_metric = in6_dev->cnf.ra_defrtr_metric; /* delete the route if lifetime is 0 or if metric needs change */ if (rt && (lifetime == 0 || rt->fib6_metric != defrtr_usr_metric)) { ip6_del_rt(net, rt, false); rt = NULL; } net_dbg_ratelimited("RA: rt: %p lifetime: %d, metric: %d, for dev: %s\n", rt, lifetime, defrtr_usr_metric, skb->dev->name); if (!rt && lifetime) { net_dbg_ratelimited("RA: adding default router\n"); if (neigh) neigh_release(neigh); rt = rt6_add_dflt_router(net, &ipv6_hdr(skb)->saddr, skb->dev, pref, defrtr_usr_metric, lifetime); if (!rt) { net_err_ratelimited("RA: %s failed to add default route\n", __func__); return reason; } neigh = ip6_neigh_lookup(&rt->fib6_nh->fib_nh_gw6, rt->fib6_nh->fib_nh_dev, NULL, &ipv6_hdr(skb)->saddr); if (!neigh) { net_err_ratelimited("RA: %s got default router without neighbour\n", __func__); fib6_info_release(rt); return reason; } neigh->flags |= NTF_ROUTER; } else if (rt && IPV6_EXTRACT_PREF(rt->fib6_flags) != pref) { struct nl_info nlinfo = { .nl_net = net, }; rt->fib6_flags = (rt->fib6_flags & ~RTF_PREF_MASK) | RTF_PREF(pref); inet6_rt_notify(RTM_NEWROUTE, rt, &nlinfo, NLM_F_REPLACE); } if (rt) { table = rt->fib6_table; spin_lock_bh(&table->tb6_lock); fib6_set_expires(rt, jiffies + (HZ * lifetime)); fib6_add_gc_list(rt); spin_unlock_bh(&table->tb6_lock); } if (READ_ONCE(in6_dev->cnf.accept_ra_min_hop_limit) < 256 && ra_msg->icmph.icmp6_hop_limit) { if (READ_ONCE(in6_dev->cnf.accept_ra_min_hop_limit) <= ra_msg->icmph.icmp6_hop_limit) { WRITE_ONCE(in6_dev->cnf.hop_limit, ra_msg->icmph.icmp6_hop_limit); fib6_metric_set(rt, RTAX_HOPLIMIT, ra_msg->icmph.icmp6_hop_limit); } else { net_dbg_ratelimited("RA: Got route advertisement with lower hop_limit than minimum\n"); } } skip_defrtr: /* * Update Reachable Time and Retrans Timer */ if (in6_dev->nd_parms) { unsigned long rtime = ntohl(ra_msg->retrans_timer); if (rtime && rtime/1000 < MAX_SCHEDULE_TIMEOUT/HZ) { rtime = (rtime*HZ)/1000; if (rtime < HZ/100) rtime = HZ/100; NEIGH_VAR_SET(in6_dev->nd_parms, RETRANS_TIME, rtime); in6_dev->tstamp = jiffies; send_ifinfo_notify = true; } rtime = ntohl(ra_msg->reachable_time); if (rtime && rtime/1000 < MAX_SCHEDULE_TIMEOUT/(3*HZ)) { rtime = (rtime*HZ)/1000; if (rtime < HZ/10) rtime = HZ/10; if (rtime != NEIGH_VAR(in6_dev->nd_parms, BASE_REACHABLE_TIME)) { NEIGH_VAR_SET(in6_dev->nd_parms, BASE_REACHABLE_TIME, rtime); NEIGH_VAR_SET(in6_dev->nd_parms, GC_STALETIME, 3 * rtime); in6_dev->nd_parms->reachable_time = neigh_rand_reach_time(rtime); in6_dev->tstamp = jiffies; send_ifinfo_notify = true; } } } skip_linkparms: /* * Process options. */ if (!neigh) neigh = __neigh_lookup(&nd_tbl, &ipv6_hdr(skb)->saddr, skb->dev, 1); if (neigh) { u8 *lladdr = NULL; if (ndopts.nd_opts_src_lladdr) { lladdr = ndisc_opt_addr_data(ndopts.nd_opts_src_lladdr, skb->dev); if (!lladdr) { net_dbg_ratelimited("RA: invalid link-layer address length\n"); goto out; } } ndisc_update(skb->dev, neigh, lladdr, NUD_STALE, NEIGH_UPDATE_F_WEAK_OVERRIDE| NEIGH_UPDATE_F_OVERRIDE| NEIGH_UPDATE_F_OVERRIDE_ISROUTER| NEIGH_UPDATE_F_ISROUTER, NDISC_ROUTER_ADVERTISEMENT, &ndopts); reason = SKB_CONSUMED; } if (!ipv6_accept_ra(in6_dev)) { net_dbg_ratelimited("RA: %s, accept_ra is false for dev: %s\n", __func__, skb->dev->name); goto out; } #ifdef CONFIG_IPV6_ROUTE_INFO if (!READ_ONCE(in6_dev->cnf.accept_ra_from_local) && ipv6_chk_addr(dev_net(in6_dev->dev), &ipv6_hdr(skb)->saddr, in6_dev->dev, 0)) { net_dbg_ratelimited("RA from local address detected on dev: %s: router info ignored.\n", skb->dev->name); goto skip_routeinfo; } if (READ_ONCE(in6_dev->cnf.accept_ra_rtr_pref) && ndopts.nd_opts_ri) { struct nd_opt_hdr *p; for (p = ndopts.nd_opts_ri; p; p = ndisc_next_option(p, ndopts.nd_opts_ri_end)) { struct route_info *ri = (struct route_info *)p; #ifdef CONFIG_IPV6_NDISC_NODETYPE if (skb->ndisc_nodetype == NDISC_NODETYPE_NODEFAULT && ri->prefix_len == 0) continue; #endif if (ri->prefix_len == 0 && !READ_ONCE(in6_dev->cnf.accept_ra_defrtr)) continue; if (ri->lifetime != 0 && ntohl(ri->lifetime) < READ_ONCE(in6_dev->cnf.accept_ra_min_lft)) continue; if (ri->prefix_len < READ_ONCE(in6_dev->cnf.accept_ra_rt_info_min_plen)) continue; if (ri->prefix_len > READ_ONCE(in6_dev->cnf.accept_ra_rt_info_max_plen)) continue; rt6_route_rcv(skb->dev, (u8 *)p, (p->nd_opt_len) << 3, &ipv6_hdr(skb)->saddr); } } skip_routeinfo: #endif #ifdef CONFIG_IPV6_NDISC_NODETYPE /* skip link-specific ndopts from interior routers */ if (skb->ndisc_nodetype == NDISC_NODETYPE_NODEFAULT) { net_dbg_ratelimited("RA: %s, nodetype is NODEFAULT (interior routes), dev: %s\n", __func__, skb->dev->name); goto out; } #endif if (READ_ONCE(in6_dev->cnf.accept_ra_pinfo) && ndopts.nd_opts_pi) { struct nd_opt_hdr *p; for (p = ndopts.nd_opts_pi; p; p = ndisc_next_option(p, ndopts.nd_opts_pi_end)) { addrconf_prefix_rcv(skb->dev, (u8 *)p, (p->nd_opt_len) << 3, ndopts.nd_opts_src_lladdr != NULL); } } if (ndopts.nd_opts_mtu && READ_ONCE(in6_dev->cnf.accept_ra_mtu)) { __be32 n; u32 mtu; memcpy(&n, ((u8 *)(ndopts.nd_opts_mtu+1))+2, sizeof(mtu)); mtu = ntohl(n); if (in6_dev->ra_mtu != mtu) { in6_dev->ra_mtu = mtu; send_ifinfo_notify = true; } if (mtu < IPV6_MIN_MTU || mtu > skb->dev->mtu) { net_dbg_ratelimited("RA: invalid mtu: %d\n", mtu); } else if (READ_ONCE(in6_dev->cnf.mtu6) != mtu) { WRITE_ONCE(in6_dev->cnf.mtu6, mtu); fib6_metric_set(rt, RTAX_MTU, mtu); rt6_mtu_change(skb->dev, mtu); } } if (ndopts.nd_useropts) { struct nd_opt_hdr *p; for (p = ndopts.nd_useropts; p; p = ndisc_next_useropt(skb->dev, p, ndopts.nd_useropts_end)) { ndisc_ra_useropt(skb, p); } } if (ndopts.nd_opts_tgt_lladdr || ndopts.nd_opts_rh) { net_dbg_ratelimited("RA: invalid RA options\n"); } out: /* Send a notify if RA changed managed/otherconf flags or * timer settings or ra_mtu value */ if (send_ifinfo_notify) inet6_ifinfo_notify(RTM_NEWLINK, in6_dev); fib6_info_release(rt); if (neigh) neigh_release(neigh); return reason; } static enum skb_drop_reason ndisc_redirect_rcv(struct sk_buff *skb) { struct rd_msg *msg = (struct rd_msg *)skb_transport_header(skb); u32 ndoptlen = skb_tail_pointer(skb) - (skb_transport_header(skb) + offsetof(struct rd_msg, opt)); struct ndisc_options ndopts; SKB_DR(reason); u8 *hdr; #ifdef CONFIG_IPV6_NDISC_NODETYPE switch (skb->ndisc_nodetype) { case NDISC_NODETYPE_HOST: case NDISC_NODETYPE_NODEFAULT: net_dbg_ratelimited("Redirect: from host or unauthorized router\n"); return reason; } #endif if (!(ipv6_addr_type(&ipv6_hdr(skb)->saddr) & IPV6_ADDR_LINKLOCAL)) { net_dbg_ratelimited("Redirect: source address is not link-local\n"); return reason; } if (!ndisc_parse_options(skb->dev, msg->opt, ndoptlen, &ndopts)) return SKB_DROP_REASON_IPV6_NDISC_BAD_OPTIONS; if (!ndopts.nd_opts_rh) { ip6_redirect_no_header(skb, dev_net(skb->dev), skb->dev->ifindex); return reason; } hdr = (u8 *)ndopts.nd_opts_rh; hdr += 8; if (!pskb_pull(skb, hdr - skb_transport_header(skb))) return SKB_DROP_REASON_PKT_TOO_SMALL; return icmpv6_notify(skb, NDISC_REDIRECT, 0, 0); } static void ndisc_fill_redirect_hdr_option(struct sk_buff *skb, struct sk_buff *orig_skb, int rd_len) { u8 *opt = skb_put(skb, rd_len); memset(opt, 0, 8); *(opt++) = ND_OPT_REDIRECT_HDR; *(opt++) = (rd_len >> 3); opt += 6; skb_copy_bits(orig_skb, skb_network_offset(orig_skb), opt, rd_len - 8); } void ndisc_send_redirect(struct sk_buff *skb, const struct in6_addr *target) { struct net_device *dev = skb->dev; struct net *net = dev_net_rcu(dev); struct sock *sk = net->ipv6.ndisc_sk; int optlen = 0; struct inet_peer *peer; struct sk_buff *buff; struct rd_msg *msg; struct in6_addr saddr_buf; struct rt6_info *rt; struct dst_entry *dst; struct flowi6 fl6; int rd_len; u8 ha_buf[MAX_ADDR_LEN], *ha = NULL, ops_data_buf[NDISC_OPS_REDIRECT_DATA_SPACE], *ops_data = NULL; bool ret; if (netif_is_l3_master(dev)) { dev = dev_get_by_index_rcu(net, IPCB(skb)->iif); if (!dev) return; } if (ipv6_get_lladdr(dev, &saddr_buf, IFA_F_TENTATIVE)) { net_dbg_ratelimited("Redirect: no link-local address on %s\n", dev->name); return; } if (!ipv6_addr_equal(&ipv6_hdr(skb)->daddr, target) && ipv6_addr_type(target) != (IPV6_ADDR_UNICAST|IPV6_ADDR_LINKLOCAL)) { net_dbg_ratelimited("Redirect: target address is not link-local unicast\n"); return; } icmpv6_flow_init(sk, &fl6, NDISC_REDIRECT, &saddr_buf, &ipv6_hdr(skb)->saddr, dev->ifindex); dst = ip6_route_output(net, NULL, &fl6); if (dst->error) { dst_release(dst); return; } dst = xfrm_lookup(net, dst, flowi6_to_flowi(&fl6), NULL, 0); if (IS_ERR(dst)) return; rt = dst_rt6_info(dst); if (rt->rt6i_flags & RTF_GATEWAY) { net_dbg_ratelimited("Redirect: destination is not a neighbour\n"); goto release; } peer = inet_getpeer_v6(net->ipv6.peers, &ipv6_hdr(skb)->saddr); ret = inet_peer_xrlim_allow(peer, 1*HZ); if (!ret) goto release; if (dev->addr_len) { struct neighbour *neigh = dst_neigh_lookup(skb_dst(skb), target); if (!neigh) { net_dbg_ratelimited("Redirect: no neigh for target address\n"); goto release; } read_lock_bh(&neigh->lock); if (neigh->nud_state & NUD_VALID) { memcpy(ha_buf, neigh->ha, dev->addr_len); read_unlock_bh(&neigh->lock); ha = ha_buf; optlen += ndisc_redirect_opt_addr_space(dev, neigh, ops_data_buf, &ops_data); } else read_unlock_bh(&neigh->lock); neigh_release(neigh); } rd_len = min_t(unsigned int, IPV6_MIN_MTU - sizeof(struct ipv6hdr) - sizeof(*msg) - optlen, skb->len + 8); rd_len &= ~0x7; optlen += rd_len; buff = ndisc_alloc_skb(dev, sizeof(*msg) + optlen); if (!buff) goto release; msg = skb_put(buff, sizeof(*msg)); *msg = (struct rd_msg) { .icmph = { .icmp6_type = NDISC_REDIRECT, }, .target = *target, .dest = ipv6_hdr(skb)->daddr, }; /* * include target_address option */ if (ha) ndisc_fill_redirect_addr_option(buff, ha, ops_data); /* * build redirect option and copy skb over to the new packet. */ if (rd_len) ndisc_fill_redirect_hdr_option(buff, skb, rd_len); skb_dst_set(buff, dst); ndisc_send_skb(buff, &ipv6_hdr(skb)->saddr, &saddr_buf); return; release: dst_release(dst); } static void pndisc_redo(struct sk_buff *skb) { enum skb_drop_reason reason = ndisc_recv_ns(skb); kfree_skb_reason(skb, reason); } static int ndisc_is_multicast(const void *pkey) { return ipv6_addr_is_multicast((struct in6_addr *)pkey); } static bool ndisc_suppress_frag_ndisc(struct sk_buff *skb) { struct inet6_dev *idev = __in6_dev_get(skb->dev); if (!idev) return true; if (IP6CB(skb)->flags & IP6SKB_FRAGMENTED && READ_ONCE(idev->cnf.suppress_frag_ndisc)) { net_warn_ratelimited("Received fragmented ndisc packet. Carefully consider disabling suppress_frag_ndisc.\n"); return true; } return false; } enum skb_drop_reason ndisc_rcv(struct sk_buff *skb) { struct nd_msg *msg; SKB_DR(reason); if (ndisc_suppress_frag_ndisc(skb)) return SKB_DROP_REASON_IPV6_NDISC_FRAG; if (skb_linearize(skb)) return SKB_DROP_REASON_NOMEM; msg = (struct nd_msg *)skb_transport_header(skb); __skb_push(skb, skb->data - skb_transport_header(skb)); if (ipv6_hdr(skb)->hop_limit != 255) { net_dbg_ratelimited("NDISC: invalid hop-limit: %d\n", ipv6_hdr(skb)->hop_limit); return SKB_DROP_REASON_IPV6_NDISC_HOP_LIMIT; } if (msg->icmph.icmp6_code != 0) { net_dbg_ratelimited("NDISC: invalid ICMPv6 code: %d\n", msg->icmph.icmp6_code); return SKB_DROP_REASON_IPV6_NDISC_BAD_CODE; } switch (msg->icmph.icmp6_type) { case NDISC_NEIGHBOUR_SOLICITATION: memset(NEIGH_CB(skb), 0, sizeof(struct neighbour_cb)); reason = ndisc_recv_ns(skb); break; case NDISC_NEIGHBOUR_ADVERTISEMENT: reason = ndisc_recv_na(skb); break; case NDISC_ROUTER_SOLICITATION: reason = ndisc_recv_rs(skb); break; case NDISC_ROUTER_ADVERTISEMENT: reason = ndisc_router_discovery(skb); break; case NDISC_REDIRECT: reason = ndisc_redirect_rcv(skb); break; } return reason; } static int ndisc_netdev_event(struct notifier_block *this, unsigned long event, void *ptr) { struct net_device *dev = netdev_notifier_info_to_dev(ptr); struct netdev_notifier_change_info *change_info; struct net *net = dev_net(dev); struct inet6_dev *idev; bool evict_nocarrier; switch (event) { case NETDEV_CHANGEADDR: neigh_changeaddr(&nd_tbl, dev); fib6_run_gc(0, net, false); fallthrough; case NETDEV_UP: idev = in6_dev_get(dev); if (!idev) break; if (READ_ONCE(idev->cnf.ndisc_notify) || READ_ONCE(net->ipv6.devconf_all->ndisc_notify)) ndisc_send_unsol_na(dev); in6_dev_put(idev); break; case NETDEV_CHANGE: idev = in6_dev_get(dev); if (!idev) evict_nocarrier = true; else { evict_nocarrier = READ_ONCE(idev->cnf.ndisc_evict_nocarrier) && READ_ONCE(net->ipv6.devconf_all->ndisc_evict_nocarrier); in6_dev_put(idev); } change_info = ptr; if (change_info->flags_changed & IFF_NOARP) neigh_changeaddr(&nd_tbl, dev); if (evict_nocarrier && !netif_carrier_ok(dev)) neigh_carrier_down(&nd_tbl, dev); break; case NETDEV_DOWN: neigh_ifdown(&nd_tbl, dev); fib6_run_gc(0, net, false); break; case NETDEV_NOTIFY_PEERS: ndisc_send_unsol_na(dev); break; default: break; } return NOTIFY_DONE; } static struct notifier_block ndisc_netdev_notifier = { .notifier_call = ndisc_netdev_event, .priority = ADDRCONF_NOTIFY_PRIORITY - 5, }; #ifdef CONFIG_SYSCTL static void ndisc_warn_deprecated_sysctl(const struct ctl_table *ctl, const char *func, const char *dev_name) { static char warncomm[TASK_COMM_LEN]; static int warned; if (strcmp(warncomm, current->comm) && warned < 5) { strscpy(warncomm, current->comm); pr_warn("process `%s' is using deprecated sysctl (%s) net.ipv6.neigh.%s.%s - use net.ipv6.neigh.%s.%s_ms instead\n", warncomm, func, dev_name, ctl->procname, dev_name, ctl->procname); warned++; } } int ndisc_ifinfo_sysctl_change(const struct ctl_table *ctl, int write, void *buffer, size_t *lenp, loff_t *ppos) { struct net_device *dev = ctl->extra1; struct inet6_dev *idev; int ret; if ((strcmp(ctl->procname, "retrans_time") == 0) || (strcmp(ctl->procname, "base_reachable_time") == 0)) ndisc_warn_deprecated_sysctl(ctl, "syscall", dev ? dev->name : "default"); if (strcmp(ctl->procname, "retrans_time") == 0) ret = neigh_proc_dointvec(ctl, write, buffer, lenp, ppos); else if (strcmp(ctl->procname, "base_reachable_time") == 0) ret = neigh_proc_dointvec_jiffies(ctl, write, buffer, lenp, ppos); else if ((strcmp(ctl->procname, "retrans_time_ms") == 0) || (strcmp(ctl->procname, "base_reachable_time_ms") == 0)) ret = neigh_proc_dointvec_ms_jiffies(ctl, write, buffer, lenp, ppos); else ret = -1; if (write && ret == 0 && dev && (idev = in6_dev_get(dev)) != NULL) { if (ctl->data == &NEIGH_VAR(idev->nd_parms, BASE_REACHABLE_TIME)) idev->nd_parms->reachable_time = neigh_rand_reach_time(NEIGH_VAR(idev->nd_parms, BASE_REACHABLE_TIME)); WRITE_ONCE(idev->tstamp, jiffies); inet6_ifinfo_notify(RTM_NEWLINK, idev); in6_dev_put(idev); } return ret; } #endif static int __net_init ndisc_net_init(struct net *net) { struct ipv6_pinfo *np; struct sock *sk; int err; err = inet_ctl_sock_create(&sk, PF_INET6, SOCK_RAW, IPPROTO_ICMPV6, net); if (err < 0) { net_err_ratelimited("NDISC: Failed to initialize the control socket (err %d)\n", err); return err; } net->ipv6.ndisc_sk = sk; np = inet6_sk(sk); np->hop_limit = 255; /* Do not loopback ndisc messages */ inet6_clear_bit(MC6_LOOP, sk); return 0; } static void __net_exit ndisc_net_exit(struct net *net) { inet_ctl_sock_destroy(net->ipv6.ndisc_sk); } static struct pernet_operations ndisc_net_ops = { .init = ndisc_net_init, .exit = ndisc_net_exit, }; int __init ndisc_init(void) { int err; err = register_pernet_subsys(&ndisc_net_ops); if (err) return err; /* * Initialize the neighbour table */ neigh_table_init(NEIGH_ND_TABLE, &nd_tbl); #ifdef CONFIG_SYSCTL err = neigh_sysctl_register(NULL, &nd_tbl.parms, ndisc_ifinfo_sysctl_change); if (err) goto out_unregister_pernet; out: #endif return err; #ifdef CONFIG_SYSCTL out_unregister_pernet: unregister_pernet_subsys(&ndisc_net_ops); goto out; #endif } int __init ndisc_late_init(void) { return register_netdevice_notifier(&ndisc_netdev_notifier); } void ndisc_late_cleanup(void) { unregister_netdevice_notifier(&ndisc_netdev_notifier); } void ndisc_cleanup(void) { #ifdef CONFIG_SYSCTL neigh_sysctl_unregister(&nd_tbl.parms); #endif neigh_table_clear(NEIGH_ND_TABLE, &nd_tbl); unregister_pernet_subsys(&ndisc_net_ops); } |
| 9 14 21 3 37 1 36 31 4 33 33 24 24 24 24 12 15 1 5 1 3 12 4 8 7 3 10 2 2 11 1 1 10 1 4 8 37 1 1 2 33 30 3 33 12 12 3 3 9 3 3 1 1 1 1 1 1 4 3 1 3 1 4 4 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 | // SPDX-License-Identifier: GPL-2.0-or-later /* * net/sched/cls_route.c ROUTE4 classifier. * * Authors: Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru> */ #include <linux/module.h> #include <linux/slab.h> #include <linux/types.h> #include <linux/kernel.h> #include <linux/string.h> #include <linux/errno.h> #include <linux/skbuff.h> #include <net/dst.h> #include <net/route.h> #include <net/netlink.h> #include <net/act_api.h> #include <net/pkt_cls.h> #include <net/tc_wrapper.h> /* * 1. For now we assume that route tags < 256. * It allows to use direct table lookups, instead of hash tables. * 2. For now we assume that "from TAG" and "fromdev DEV" statements * are mutually exclusive. * 3. "to TAG from ANY" has higher priority, than "to ANY from XXX" */ struct route4_fastmap { struct route4_filter *filter; u32 id; int iif; }; struct route4_head { struct route4_fastmap fastmap[16]; struct route4_bucket __rcu *table[256 + 1]; struct rcu_head rcu; }; struct route4_bucket { /* 16 FROM buckets + 16 IIF buckets + 1 wildcard bucket */ struct route4_filter __rcu *ht[16 + 16 + 1]; struct rcu_head rcu; }; struct route4_filter { struct route4_filter __rcu *next; u32 id; int iif; struct tcf_result res; struct tcf_exts exts; u32 handle; struct route4_bucket *bkt; struct tcf_proto *tp; struct rcu_work rwork; }; #define ROUTE4_FAILURE ((struct route4_filter *)(-1L)) static inline int route4_fastmap_hash(u32 id, int iif) { return id & 0xF; } static DEFINE_SPINLOCK(fastmap_lock); static void route4_reset_fastmap(struct route4_head *head) { spin_lock_bh(&fastmap_lock); memset(head->fastmap, 0, sizeof(head->fastmap)); spin_unlock_bh(&fastmap_lock); } static void route4_set_fastmap(struct route4_head *head, u32 id, int iif, struct route4_filter *f) { int h = route4_fastmap_hash(id, iif); /* fastmap updates must look atomic to aling id, iff, filter */ spin_lock_bh(&fastmap_lock); head->fastmap[h].id = id; head->fastmap[h].iif = iif; head->fastmap[h].filter = f; spin_unlock_bh(&fastmap_lock); } static inline int route4_hash_to(u32 id) { return id & 0xFF; } static inline int route4_hash_from(u32 id) { return (id >> 16) & 0xF; } static inline int route4_hash_iif(int iif) { return 16 + ((iif >> 16) & 0xF); } static inline int route4_hash_wild(void) { return 32; } #define ROUTE4_APPLY_RESULT() \ { \ *res = f->res; \ if (tcf_exts_has_actions(&f->exts)) { \ int r = tcf_exts_exec(skb, &f->exts, res); \ if (r < 0) { \ dont_cache = 1; \ continue; \ } \ return r; \ } else if (!dont_cache) \ route4_set_fastmap(head, id, iif, f); \ return 0; \ } TC_INDIRECT_SCOPE int route4_classify(struct sk_buff *skb, const struct tcf_proto *tp, struct tcf_result *res) { struct route4_head *head = rcu_dereference_bh(tp->root); struct dst_entry *dst; struct route4_bucket *b; struct route4_filter *f; u32 id, h; int iif, dont_cache = 0; dst = skb_dst(skb); if (!dst) goto failure; id = dst->tclassid; iif = inet_iif(skb); h = route4_fastmap_hash(id, iif); spin_lock(&fastmap_lock); if (id == head->fastmap[h].id && iif == head->fastmap[h].iif && (f = head->fastmap[h].filter) != NULL) { if (f == ROUTE4_FAILURE) { spin_unlock(&fastmap_lock); goto failure; } *res = f->res; spin_unlock(&fastmap_lock); return 0; } spin_unlock(&fastmap_lock); h = route4_hash_to(id); restart: b = rcu_dereference_bh(head->table[h]); if (b) { for (f = rcu_dereference_bh(b->ht[route4_hash_from(id)]); f; f = rcu_dereference_bh(f->next)) if (f->id == id) ROUTE4_APPLY_RESULT(); for (f = rcu_dereference_bh(b->ht[route4_hash_iif(iif)]); f; f = rcu_dereference_bh(f->next)) if (f->iif == iif) ROUTE4_APPLY_RESULT(); for (f = rcu_dereference_bh(b->ht[route4_hash_wild()]); f; f = rcu_dereference_bh(f->next)) ROUTE4_APPLY_RESULT(); } if (h < 256) { h = 256; id &= ~0xFFFF; goto restart; } if (!dont_cache) route4_set_fastmap(head, id, iif, ROUTE4_FAILURE); failure: return -1; } static inline u32 to_hash(u32 id) { u32 h = id & 0xFF; if (id & 0x8000) h += 256; return h; } static inline u32 from_hash(u32 id) { id &= 0xFFFF; if (id == 0xFFFF) return 32; if (!(id & 0x8000)) { if (id > 255) return 256; return id & 0xF; } return 16 + (id & 0xF); } static void *route4_get(struct tcf_proto *tp, u32 handle) { struct route4_head *head = rtnl_dereference(tp->root); struct route4_bucket *b; struct route4_filter *f; unsigned int h1, h2; h1 = to_hash(handle); if (h1 > 256) return NULL; h2 = from_hash(handle >> 16); if (h2 > 32) return NULL; b = rtnl_dereference(head->table[h1]); if (b) { for (f = rtnl_dereference(b->ht[h2]); f; f = rtnl_dereference(f->next)) if (f->handle == handle) return f; } return NULL; } static int route4_init(struct tcf_proto *tp) { struct route4_head *head; head = kzalloc(sizeof(struct route4_head), GFP_KERNEL); if (head == NULL) return -ENOBUFS; rcu_assign_pointer(tp->root, head); return 0; } static void __route4_delete_filter(struct route4_filter *f) { tcf_exts_destroy(&f->exts); tcf_exts_put_net(&f->exts); kfree(f); } static void route4_delete_filter_work(struct work_struct *work) { struct route4_filter *f = container_of(to_rcu_work(work), struct route4_filter, rwork); rtnl_lock(); __route4_delete_filter(f); rtnl_unlock(); } static void route4_queue_work(struct route4_filter *f) { tcf_queue_work(&f->rwork, route4_delete_filter_work); } static void route4_destroy(struct tcf_proto *tp, bool rtnl_held, struct netlink_ext_ack *extack) { struct route4_head *head = rtnl_dereference(tp->root); int h1, h2; if (head == NULL) return; for (h1 = 0; h1 <= 256; h1++) { struct route4_bucket *b; b = rtnl_dereference(head->table[h1]); if (b) { for (h2 = 0; h2 <= 32; h2++) { struct route4_filter *f; while ((f = rtnl_dereference(b->ht[h2])) != NULL) { struct route4_filter *next; next = rtnl_dereference(f->next); RCU_INIT_POINTER(b->ht[h2], next); tcf_unbind_filter(tp, &f->res); if (tcf_exts_get_net(&f->exts)) route4_queue_work(f); else __route4_delete_filter(f); } } RCU_INIT_POINTER(head->table[h1], NULL); kfree_rcu(b, rcu); } } kfree_rcu(head, rcu); } static int route4_delete(struct tcf_proto *tp, void *arg, bool *last, bool rtnl_held, struct netlink_ext_ack *extack) { struct route4_head *head = rtnl_dereference(tp->root); struct route4_filter *f = arg; struct route4_filter __rcu **fp; struct route4_filter *nf; struct route4_bucket *b; unsigned int h = 0; int i, h1; if (!head || !f) return -EINVAL; h = f->handle; b = f->bkt; fp = &b->ht[from_hash(h >> 16)]; for (nf = rtnl_dereference(*fp); nf; fp = &nf->next, nf = rtnl_dereference(*fp)) { if (nf == f) { /* unlink it */ RCU_INIT_POINTER(*fp, rtnl_dereference(f->next)); /* Remove any fastmap lookups that might ref filter * notice we unlink'd the filter so we can't get it * back in the fastmap. */ route4_reset_fastmap(head); /* Delete it */ tcf_unbind_filter(tp, &f->res); tcf_exts_get_net(&f->exts); tcf_queue_work(&f->rwork, route4_delete_filter_work); /* Strip RTNL protected tree */ for (i = 0; i <= 32; i++) { struct route4_filter *rt; rt = rtnl_dereference(b->ht[i]); if (rt) goto out; } /* OK, session has no flows */ RCU_INIT_POINTER(head->table[to_hash(h)], NULL); kfree_rcu(b, rcu); break; } } out: *last = true; for (h1 = 0; h1 <= 256; h1++) { if (rcu_access_pointer(head->table[h1])) { *last = false; break; } } return 0; } static const struct nla_policy route4_policy[TCA_ROUTE4_MAX + 1] = { [TCA_ROUTE4_CLASSID] = { .type = NLA_U32 }, [TCA_ROUTE4_TO] = NLA_POLICY_MAX(NLA_U32, 0xFF), [TCA_ROUTE4_FROM] = NLA_POLICY_MAX(NLA_U32, 0xFF), [TCA_ROUTE4_IIF] = NLA_POLICY_MAX(NLA_U32, 0x7FFF), }; static int route4_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base, struct route4_filter *f, u32 handle, struct route4_head *head, struct nlattr **tb, struct nlattr *est, int new, u32 flags, struct netlink_ext_ack *extack) { u32 id = 0, to = 0, nhandle = 0x8000; struct route4_filter *fp; unsigned int h1; struct route4_bucket *b; int err; err = tcf_exts_validate(net, tp, tb, est, &f->exts, flags, extack); if (err < 0) return err; if (tb[TCA_ROUTE4_TO]) { if (new && handle & 0x8000) { NL_SET_ERR_MSG(extack, "Invalid handle"); return -EINVAL; } to = nla_get_u32(tb[TCA_ROUTE4_TO]); nhandle = to; } if (tb[TCA_ROUTE4_FROM] && tb[TCA_ROUTE4_IIF]) { NL_SET_ERR_MSG_ATTR(extack, tb[TCA_ROUTE4_FROM], "'from' and 'fromif' are mutually exclusive"); return -EINVAL; } if (tb[TCA_ROUTE4_FROM]) { id = nla_get_u32(tb[TCA_ROUTE4_FROM]); nhandle |= id << 16; } else if (tb[TCA_ROUTE4_IIF]) { id = nla_get_u32(tb[TCA_ROUTE4_IIF]); nhandle |= (id | 0x8000) << 16; } else nhandle |= 0xFFFF << 16; if (handle && new) { nhandle |= handle & 0x7F00; if (nhandle != handle) { NL_SET_ERR_MSG_FMT(extack, "Handle mismatch constructed: %x (expected: %x)", handle, nhandle); return -EINVAL; } } if (!nhandle) { NL_SET_ERR_MSG(extack, "Replacing with handle of 0 is invalid"); return -EINVAL; } h1 = to_hash(nhandle); b = rtnl_dereference(head->table[h1]); if (!b) { b = kzalloc(sizeof(struct route4_bucket), GFP_KERNEL); if (b == NULL) return -ENOBUFS; rcu_assign_pointer(head->table[h1], b); } else { unsigned int h2 = from_hash(nhandle >> 16); for (fp = rtnl_dereference(b->ht[h2]); fp; fp = rtnl_dereference(fp->next)) if (fp->handle == f->handle) return -EEXIST; } if (tb[TCA_ROUTE4_TO]) f->id = to; if (tb[TCA_ROUTE4_FROM]) f->id = to | id<<16; else if (tb[TCA_ROUTE4_IIF]) f->iif = id; f->handle = nhandle; f->bkt = b; f->tp = tp; if (tb[TCA_ROUTE4_CLASSID]) { f->res.classid = nla_get_u32(tb[TCA_ROUTE4_CLASSID]); tcf_bind_filter(tp, &f->res, base); } return 0; } static int route4_change(struct net *net, struct sk_buff *in_skb, struct tcf_proto *tp, unsigned long base, u32 handle, struct nlattr **tca, void **arg, u32 flags, struct netlink_ext_ack *extack) { struct route4_head *head = rtnl_dereference(tp->root); struct route4_filter __rcu **fp; struct route4_filter *fold, *f1, *pfp, *f = NULL; struct route4_bucket *b; struct nlattr *tb[TCA_ROUTE4_MAX + 1]; unsigned int h, th; int err; bool new = true; if (!handle) { NL_SET_ERR_MSG(extack, "Creating with handle of 0 is invalid"); return -EINVAL; } if (NL_REQ_ATTR_CHECK(extack, NULL, tca, TCA_OPTIONS)) { NL_SET_ERR_MSG_MOD(extack, "Missing options"); return -EINVAL; } err = nla_parse_nested_deprecated(tb, TCA_ROUTE4_MAX, tca[TCA_OPTIONS], route4_policy, NULL); if (err < 0) return err; fold = *arg; if (fold && fold->handle != handle) return -EINVAL; err = -ENOBUFS; f = kzalloc(sizeof(struct route4_filter), GFP_KERNEL); if (!f) goto errout; err = tcf_exts_init(&f->exts, net, TCA_ROUTE4_ACT, TCA_ROUTE4_POLICE); if (err < 0) goto errout; if (fold) { f->id = fold->id; f->iif = fold->iif; f->handle = fold->handle; f->tp = fold->tp; f->bkt = fold->bkt; new = false; } err = route4_set_parms(net, tp, base, f, handle, head, tb, tca[TCA_RATE], new, flags, extack); if (err < 0) goto errout; h = from_hash(f->handle >> 16); fp = &f->bkt->ht[h]; for (pfp = rtnl_dereference(*fp); (f1 = rtnl_dereference(*fp)) != NULL; fp = &f1->next) if (f->handle < f1->handle) break; tcf_block_netif_keep_dst(tp->chain->block); rcu_assign_pointer(f->next, f1); rcu_assign_pointer(*fp, f); if (fold) { th = to_hash(fold->handle); h = from_hash(fold->handle >> 16); b = rtnl_dereference(head->table[th]); if (b) { fp = &b->ht[h]; for (pfp = rtnl_dereference(*fp); pfp; fp = &pfp->next, pfp = rtnl_dereference(*fp)) { if (pfp == fold) { rcu_assign_pointer(*fp, fold->next); break; } } } } route4_reset_fastmap(head); *arg = f; if (fold) { tcf_unbind_filter(tp, &fold->res); tcf_exts_get_net(&fold->exts); tcf_queue_work(&fold->rwork, route4_delete_filter_work); } return 0; errout: if (f) tcf_exts_destroy(&f->exts); kfree(f); return err; } static void route4_walk(struct tcf_proto *tp, struct tcf_walker *arg, bool rtnl_held) { struct route4_head *head = rtnl_dereference(tp->root); unsigned int h, h1; if (head == NULL || arg->stop) return; for (h = 0; h <= 256; h++) { struct route4_bucket *b = rtnl_dereference(head->table[h]); if (b) { for (h1 = 0; h1 <= 32; h1++) { struct route4_filter *f; for (f = rtnl_dereference(b->ht[h1]); f; f = rtnl_dereference(f->next)) { if (!tc_cls_stats_dump(tp, arg, f)) return; } } } } } static int route4_dump(struct net *net, struct tcf_proto *tp, void *fh, struct sk_buff *skb, struct tcmsg *t, bool rtnl_held) { struct route4_filter *f = fh; struct nlattr *nest; u32 id; if (f == NULL) return skb->len; t->tcm_handle = f->handle; nest = nla_nest_start_noflag(skb, TCA_OPTIONS); if (nest == NULL) goto nla_put_failure; if (!(f->handle & 0x8000)) { id = f->id & 0xFF; if (nla_put_u32(skb, TCA_ROUTE4_TO, id)) goto nla_put_failure; } if (f->handle & 0x80000000) { if ((f->handle >> 16) != 0xFFFF && nla_put_u32(skb, TCA_ROUTE4_IIF, f->iif)) goto nla_put_failure; } else { id = f->id >> 16; if (nla_put_u32(skb, TCA_ROUTE4_FROM, id)) goto nla_put_failure; } if (f->res.classid && nla_put_u32(skb, TCA_ROUTE4_CLASSID, f->res.classid)) goto nla_put_failure; if (tcf_exts_dump(skb, &f->exts) < 0) goto nla_put_failure; nla_nest_end(skb, nest); if (tcf_exts_dump_stats(skb, &f->exts) < 0) goto nla_put_failure; return skb->len; nla_put_failure: nla_nest_cancel(skb, nest); return -1; } static void route4_bind_class(void *fh, u32 classid, unsigned long cl, void *q, unsigned long base) { struct route4_filter *f = fh; tc_cls_bind_class(classid, cl, q, &f->res, base); } static struct tcf_proto_ops cls_route4_ops __read_mostly = { .kind = "route", .classify = route4_classify, .init = route4_init, .destroy = route4_destroy, .get = route4_get, .change = route4_change, .delete = route4_delete, .walk = route4_walk, .dump = route4_dump, .bind_class = route4_bind_class, .owner = THIS_MODULE, }; MODULE_ALIAS_NET_CLS("route"); static int __init init_route4(void) { return register_tcf_proto_ops(&cls_route4_ops); } static void __exit exit_route4(void) { unregister_tcf_proto_ops(&cls_route4_ops); } module_init(init_route4) module_exit(exit_route4) MODULE_DESCRIPTION("Routing table realm based TC classifier"); MODULE_LICENSE("GPL"); |
| 17 11 20 10 20 20 823 29 833 833 833 439 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 | /* SPDX-License-Identifier: GPL-2.0 * * page_pool/helpers.h * Author: Jesper Dangaard Brouer <netoptimizer@brouer.com> * Copyright (C) 2016 Red Hat, Inc. */ /** * DOC: page_pool allocator * * The page_pool allocator is optimized for recycling page or page fragment used * by skb packet and xdp frame. * * Basic use involves replacing any alloc_pages() calls with page_pool_alloc(), * which allocate memory with or without page splitting depending on the * requested memory size. * * If the driver knows that it always requires full pages or its allocations are * always smaller than half a page, it can use one of the more specific API * calls: * * 1. page_pool_alloc_pages(): allocate memory without page splitting when * driver knows that the memory it need is always bigger than half of the page * allocated from page pool. There is no cache line dirtying for 'struct page' * when a page is recycled back to the page pool. * * 2. page_pool_alloc_frag(): allocate memory with page splitting when driver * knows that the memory it need is always smaller than or equal to half of the * page allocated from page pool. Page splitting enables memory saving and thus * avoids TLB/cache miss for data access, but there also is some cost to * implement page splitting, mainly some cache line dirtying/bouncing for * 'struct page' and atomic operation for page->pp_ref_count. * * The API keeps track of in-flight pages, in order to let API users know when * it is safe to free a page_pool object, the API users must call * page_pool_put_page() or page_pool_free_va() to free the page_pool object, or * attach the page_pool object to a page_pool-aware object like skbs marked with * skb_mark_for_recycle(). * * page_pool_put_page() may be called multiple times on the same page if a page * is split into multiple fragments. For the last fragment, it will either * recycle the page, or in case of page->_refcount > 1, it will release the DMA * mapping and in-flight state accounting. * * dma_sync_single_range_for_device() is only called for the last fragment when * page_pool is created with PP_FLAG_DMA_SYNC_DEV flag, so it depends on the * last freed fragment to do the sync_for_device operation for all fragments in * the same page when a page is split. The API user must setup pool->p.max_len * and pool->p.offset correctly and ensure that page_pool_put_page() is called * with dma_sync_size being -1 for fragment API. */ #ifndef _NET_PAGE_POOL_HELPERS_H #define _NET_PAGE_POOL_HELPERS_H #include <linux/dma-mapping.h> #include <net/page_pool/types.h> #include <net/net_debug.h> #include <net/netmem.h> #ifdef CONFIG_PAGE_POOL_STATS /* Deprecated driver-facing API, use netlink instead */ int page_pool_ethtool_stats_get_count(void); u8 *page_pool_ethtool_stats_get_strings(u8 *data); u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats); bool page_pool_get_stats(const struct page_pool *pool, struct page_pool_stats *stats); #else static inline int page_pool_ethtool_stats_get_count(void) { return 0; } static inline u8 *page_pool_ethtool_stats_get_strings(u8 *data) { return data; } static inline u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats) { return data; } #endif /** * page_pool_dev_alloc_pages() - allocate a page. * @pool: pool from which to allocate * * Get a page from the page allocator or page_pool caches. */ static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool) { gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN); return page_pool_alloc_pages(pool, gfp); } /** * page_pool_dev_alloc_frag() - allocate a page fragment. * @pool: pool from which to allocate * @offset: offset to the allocated page * @size: requested size * * Get a page fragment from the page allocator or page_pool caches. * * Return: allocated page fragment, otherwise return NULL. */ static inline struct page *page_pool_dev_alloc_frag(struct page_pool *pool, unsigned int *offset, unsigned int size) { gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN); return page_pool_alloc_frag(pool, offset, size, gfp); } static inline netmem_ref page_pool_alloc_netmem(struct page_pool *pool, unsigned int *offset, unsigned int *size, gfp_t gfp) { unsigned int max_size = PAGE_SIZE << pool->p.order; netmem_ref netmem; if ((*size << 1) > max_size) { *size = max_size; *offset = 0; return page_pool_alloc_netmems(pool, gfp); } netmem = page_pool_alloc_frag_netmem(pool, offset, *size, gfp); if (unlikely(!netmem)) return 0; /* There is very likely not enough space for another fragment, so append * the remaining size to the current fragment to avoid truesize * underestimate problem. */ if (pool->frag_offset + *size > max_size) { *size = max_size - *offset; pool->frag_offset = max_size; } return netmem; } static inline netmem_ref page_pool_dev_alloc_netmem(struct page_pool *pool, unsigned int *offset, unsigned int *size) { gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN; return page_pool_alloc_netmem(pool, offset, size, gfp); } static inline netmem_ref page_pool_dev_alloc_netmems(struct page_pool *pool) { gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN; return page_pool_alloc_netmems(pool, gfp); } static inline struct page *page_pool_alloc(struct page_pool *pool, unsigned int *offset, unsigned int *size, gfp_t gfp) { return netmem_to_page(page_pool_alloc_netmem(pool, offset, size, gfp)); } /** * page_pool_dev_alloc() - allocate a page or a page fragment. * @pool: pool from which to allocate * @offset: offset to the allocated page * @size: in as the requested size, out as the allocated size * * Get a page or a page fragment from the page allocator or page_pool caches * depending on the requested size in order to allocate memory with least memory * utilization and performance penalty. * * Return: allocated page or page fragment, otherwise return NULL. */ static inline struct page *page_pool_dev_alloc(struct page_pool *pool, unsigned int *offset, unsigned int *size) { gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN); return page_pool_alloc(pool, offset, size, gfp); } static inline void *page_pool_alloc_va(struct page_pool *pool, unsigned int *size, gfp_t gfp) { unsigned int offset; struct page *page; /* Mask off __GFP_HIGHMEM to ensure we can use page_address() */ page = page_pool_alloc(pool, &offset, size, gfp & ~__GFP_HIGHMEM); if (unlikely(!page)) return NULL; return page_address(page) + offset; } /** * page_pool_dev_alloc_va() - allocate a page or a page fragment and return its * va. * @pool: pool from which to allocate * @size: in as the requested size, out as the allocated size * * This is just a thin wrapper around the page_pool_alloc() API, and * it returns va of the allocated page or page fragment. * * Return: the va for the allocated page or page fragment, otherwise return NULL. */ static inline void *page_pool_dev_alloc_va(struct page_pool *pool, unsigned int *size) { gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN); return page_pool_alloc_va(pool, size, gfp); } /** * page_pool_get_dma_dir() - Retrieve the stored DMA direction. * @pool: pool from which page was allocated * * Get the stored dma direction. A driver might decide to store this locally * and avoid the extra cache line from page_pool to determine the direction. */ static inline enum dma_data_direction page_pool_get_dma_dir(const struct page_pool *pool) { return pool->p.dma_dir; } static inline void page_pool_fragment_netmem(netmem_ref netmem, long nr) { atomic_long_set(netmem_get_pp_ref_count_ref(netmem), nr); } /** * page_pool_fragment_page() - split a fresh page into fragments * @page: page to split * @nr: references to set * * pp_ref_count represents the number of outstanding references to the page, * which will be freed using page_pool APIs (rather than page allocator APIs * like put_page()). Such references are usually held by page_pool-aware * objects like skbs marked for page pool recycling. * * This helper allows the caller to take (set) multiple references to a * freshly allocated page. The page must be freshly allocated (have a * pp_ref_count of 1). This is commonly done by drivers and * "fragment allocators" to save atomic operations - either when they know * upfront how many references they will need; or to take MAX references and * return the unused ones with a single atomic dec(), instead of performing * multiple atomic inc() operations. */ static inline void page_pool_fragment_page(struct page *page, long nr) { page_pool_fragment_netmem(page_to_netmem(page), nr); } static inline long page_pool_unref_netmem(netmem_ref netmem, long nr) { atomic_long_t *pp_ref_count = netmem_get_pp_ref_count_ref(netmem); long ret; /* If nr == pp_ref_count then we have cleared all remaining * references to the page: * 1. 'n == 1': no need to actually overwrite it. * 2. 'n != 1': overwrite it with one, which is the rare case * for pp_ref_count draining. * * The main advantage to doing this is that not only we avoid a atomic * update, as an atomic_read is generally a much cheaper operation than * an atomic update, especially when dealing with a page that may be * referenced by only 2 or 3 users; but also unify the pp_ref_count * handling by ensuring all pages have partitioned into only 1 piece * initially, and only overwrite it when the page is partitioned into * more than one piece. */ if (atomic_long_read(pp_ref_count) == nr) { /* As we have ensured nr is always one for constant case using * the BUILD_BUG_ON(), only need to handle the non-constant case * here for pp_ref_count draining, which is a rare case. */ BUILD_BUG_ON(__builtin_constant_p(nr) && nr != 1); if (!__builtin_constant_p(nr)) atomic_long_set(pp_ref_count, 1); return 0; } ret = atomic_long_sub_return(nr, pp_ref_count); WARN_ON(ret < 0); /* We are the last user here too, reset pp_ref_count back to 1 to * ensure all pages have been partitioned into 1 piece initially, * this should be the rare case when the last two fragment users call * page_pool_unref_page() currently. */ if (unlikely(!ret)) atomic_long_set(pp_ref_count, 1); return ret; } static inline long page_pool_unref_page(struct page *page, long nr) { return page_pool_unref_netmem(page_to_netmem(page), nr); } static inline void page_pool_ref_netmem(netmem_ref netmem) { atomic_long_inc(netmem_get_pp_ref_count_ref(netmem)); } static inline void page_pool_ref_page(struct page *page) { page_pool_ref_netmem(page_to_netmem(page)); } static inline bool page_pool_unref_and_test(netmem_ref netmem) { /* If page_pool_unref_page() returns 0, we were the last user */ return page_pool_unref_netmem(netmem, 1) == 0; } static inline void page_pool_put_netmem(struct page_pool *pool, netmem_ref netmem, unsigned int dma_sync_size, bool allow_direct) { /* When page_pool isn't compiled-in, net/core/xdp.c doesn't * allow registering MEM_TYPE_PAGE_POOL, but shield linker. */ #ifdef CONFIG_PAGE_POOL if (!page_pool_unref_and_test(netmem)) return; page_pool_put_unrefed_netmem(pool, netmem, dma_sync_size, allow_direct); #endif } /** * page_pool_put_page() - release a reference to a page pool page * @pool: pool from which page was allocated * @page: page to release a reference on * @dma_sync_size: how much of the page may have been touched by the device * @allow_direct: released by the consumer, allow lockless caching * * The outcome of this depends on the page refcnt. If the driver bumps * the refcnt > 1 this will unmap the page. If the page refcnt is 1 * the allocator owns the page and will try to recycle it in one of the pool * caches. If PP_FLAG_DMA_SYNC_DEV is set, the page will be synced for_device * using dma_sync_single_range_for_device(). */ static inline void page_pool_put_page(struct page_pool *pool, struct page *page, unsigned int dma_sync_size, bool allow_direct) { page_pool_put_netmem(pool, page_to_netmem(page), dma_sync_size, allow_direct); } static inline void page_pool_put_full_netmem(struct page_pool *pool, netmem_ref netmem, bool allow_direct) { page_pool_put_netmem(pool, netmem, -1, allow_direct); } /** * page_pool_put_full_page() - release a reference on a page pool page * @pool: pool from which page was allocated * @page: page to release a reference on * @allow_direct: released by the consumer, allow lockless caching * * Similar to page_pool_put_page(), but will DMA sync the entire memory area * as configured in &page_pool_params.max_len. */ static inline void page_pool_put_full_page(struct page_pool *pool, struct page *page, bool allow_direct) { page_pool_put_netmem(pool, page_to_netmem(page), -1, allow_direct); } /** * page_pool_recycle_direct() - release a reference on a page pool page * @pool: pool from which page was allocated * @page: page to release a reference on * * Similar to page_pool_put_full_page() but caller must guarantee safe context * (e.g NAPI), since it will recycle the page directly into the pool fast cache. */ static inline void page_pool_recycle_direct(struct page_pool *pool, struct page *page) { page_pool_put_full_page(pool, page, true); } static inline void page_pool_recycle_direct_netmem(struct page_pool *pool, netmem_ref netmem) { page_pool_put_full_netmem(pool, netmem, true); } #define PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA \ (sizeof(dma_addr_t) > sizeof(unsigned long)) /** * page_pool_free_va() - free a va into the page_pool * @pool: pool from which va was allocated * @va: va to be freed * @allow_direct: freed by the consumer, allow lockless caching * * Free a va allocated from page_pool_allo_va(). */ static inline void page_pool_free_va(struct page_pool *pool, void *va, bool allow_direct) { page_pool_put_page(pool, virt_to_head_page(va), -1, allow_direct); } static inline dma_addr_t page_pool_get_dma_addr_netmem(netmem_ref netmem) { dma_addr_t ret = netmem_get_dma_addr(netmem); if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) ret <<= PAGE_SHIFT; return ret; } /** * page_pool_get_dma_addr() - Retrieve the stored DMA address. * @page: page allocated from a page pool * * Fetch the DMA address of the page. The page pool to which the page belongs * must had been created with PP_FLAG_DMA_MAP. */ static inline dma_addr_t page_pool_get_dma_addr(const struct page *page) { return page_pool_get_dma_addr_netmem(page_to_netmem(page)); } static inline void __page_pool_dma_sync_for_cpu(const struct page_pool *pool, const dma_addr_t dma_addr, u32 offset, u32 dma_sync_size) { dma_sync_single_range_for_cpu(pool->p.dev, dma_addr, offset + pool->p.offset, dma_sync_size, page_pool_get_dma_dir(pool)); } /** * page_pool_dma_sync_for_cpu - sync Rx page for CPU after it's written by HW * @pool: &page_pool the @page belongs to * @page: page to sync * @offset: offset from page start to "hard" start if using PP frags * @dma_sync_size: size of the data written to the page * * Can be used as a shorthand to sync Rx pages before accessing them in the * driver. Caller must ensure the pool was created with ``PP_FLAG_DMA_MAP``. * Note that this version performs DMA sync unconditionally, even if the * associated PP doesn't perform sync-for-device. */ static inline void page_pool_dma_sync_for_cpu(const struct page_pool *pool, const struct page *page, u32 offset, u32 dma_sync_size) { __page_pool_dma_sync_for_cpu(pool, page_pool_get_dma_addr(page), offset, dma_sync_size); } static inline void page_pool_dma_sync_netmem_for_cpu(const struct page_pool *pool, const netmem_ref netmem, u32 offset, u32 dma_sync_size) { if (!pool->dma_sync_for_cpu) return; __page_pool_dma_sync_for_cpu(pool, page_pool_get_dma_addr_netmem(netmem), offset, dma_sync_size); } static inline bool page_pool_put(struct page_pool *pool) { return refcount_dec_and_test(&pool->user_cnt); } static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid) { if (unlikely(pool->p.nid != new_nid)) page_pool_update_nid(pool, new_nid); } static inline bool page_pool_is_unreadable(struct page_pool *pool) { return !!pool->mp_ops; } #endif /* _NET_PAGE_POOL_HELPERS_H */ |
| 1 1 3 2 1 1 3 1 2 133 133 133 131 131 77 77 2 2 2 3 1 3 2 2 2 1 1 36 1 5 35 3 1 1 1 7 4 2 2 2 4 2 4 1 1 5 3 1 1 1 5 1 2 1 1 3 1 1 1 27 2 24 1 2 1 22 3 1 1 1 1 3 1 1 1 1 11 4 5 2 5 3 5 4 2 3 3 3 3 4 2 1 1 5 2 1 2 9 2 7 3 2 1 1 1 1 1 1 1 2 8 1 2 1 1 1 2 5 1 2 1 1 706 314 405 131 74 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Copyright (C) 2011 Instituto Nokia de Tecnologia * * Authors: * Lauro Ramos Venancio <lauro.venancio@openbossa.org> * Aloisio Almeida Jr <aloisio.almeida@openbossa.org> * * Vendor commands implementation based on net/wireless/nl80211.c * which is: * * Copyright 2006-2010 Johannes Berg <johannes@sipsolutions.net> * Copyright 2013-2014 Intel Mobile Communications GmbH */ #define pr_fmt(fmt) KBUILD_MODNAME ": %s: " fmt, __func__ #include <net/genetlink.h> #include <linux/nfc.h> #include <linux/slab.h> #include "nfc.h" #include "llcp.h" static const struct genl_multicast_group nfc_genl_mcgrps[] = { { .name = NFC_GENL_MCAST_EVENT_NAME, }, }; static struct genl_family nfc_genl_family; static const struct nla_policy nfc_genl_policy[NFC_ATTR_MAX + 1] = { [NFC_ATTR_DEVICE_INDEX] = { .type = NLA_U32 }, [NFC_ATTR_DEVICE_NAME] = { .type = NLA_STRING, .len = NFC_DEVICE_NAME_MAXSIZE }, [NFC_ATTR_PROTOCOLS] = { .type = NLA_U32 }, [NFC_ATTR_TARGET_INDEX] = { .type = NLA_U32 }, [NFC_ATTR_COMM_MODE] = { .type = NLA_U8 }, [NFC_ATTR_RF_MODE] = { .type = NLA_U8 }, [NFC_ATTR_DEVICE_POWERED] = { .type = NLA_U8 }, [NFC_ATTR_IM_PROTOCOLS] = { .type = NLA_U32 }, [NFC_ATTR_TM_PROTOCOLS] = { .type = NLA_U32 }, [NFC_ATTR_LLC_PARAM_LTO] = { .type = NLA_U8 }, [NFC_ATTR_LLC_PARAM_RW] = { .type = NLA_U8 }, [NFC_ATTR_LLC_PARAM_MIUX] = { .type = NLA_U16 }, [NFC_ATTR_LLC_SDP] = { .type = NLA_NESTED }, [NFC_ATTR_FIRMWARE_NAME] = { .type = NLA_STRING, .len = NFC_FIRMWARE_NAME_MAXSIZE }, [NFC_ATTR_SE_INDEX] = { .type = NLA_U32 }, [NFC_ATTR_SE_APDU] = { .type = NLA_BINARY }, [NFC_ATTR_VENDOR_ID] = { .type = NLA_U32 }, [NFC_ATTR_VENDOR_SUBCMD] = { .type = NLA_U32 }, [NFC_ATTR_VENDOR_DATA] = { .type = NLA_BINARY }, }; static const struct nla_policy nfc_sdp_genl_policy[NFC_SDP_ATTR_MAX + 1] = { [NFC_SDP_ATTR_URI] = { .type = NLA_STRING, .len = U8_MAX - 4 }, [NFC_SDP_ATTR_SAP] = { .type = NLA_U8 }, }; static int nfc_genl_send_target(struct sk_buff *msg, struct nfc_target *target, struct netlink_callback *cb, int flags) { void *hdr; hdr = genlmsg_put(msg, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, &nfc_genl_family, flags, NFC_CMD_GET_TARGET); if (!hdr) return -EMSGSIZE; genl_dump_check_consistent(cb, hdr); if (nla_put_u32(msg, NFC_ATTR_TARGET_INDEX, target->idx) || nla_put_u32(msg, NFC_ATTR_PROTOCOLS, target->supported_protocols) || nla_put_u16(msg, NFC_ATTR_TARGET_SENS_RES, target->sens_res) || nla_put_u8(msg, NFC_ATTR_TARGET_SEL_RES, target->sel_res)) goto nla_put_failure; if (target->nfcid1_len > 0 && nla_put(msg, NFC_ATTR_TARGET_NFCID1, target->nfcid1_len, target->nfcid1)) goto nla_put_failure; if (target->sensb_res_len > 0 && nla_put(msg, NFC_ATTR_TARGET_SENSB_RES, target->sensb_res_len, target->sensb_res)) goto nla_put_failure; if (target->sensf_res_len > 0 && nla_put(msg, NFC_ATTR_TARGET_SENSF_RES, target->sensf_res_len, target->sensf_res)) goto nla_put_failure; if (target->is_iso15693) { if (nla_put_u8(msg, NFC_ATTR_TARGET_ISO15693_DSFID, target->iso15693_dsfid) || nla_put(msg, NFC_ATTR_TARGET_ISO15693_UID, sizeof(target->iso15693_uid), target->iso15693_uid)) goto nla_put_failure; } if (target->ats_len > 0 && nla_put(msg, NFC_ATTR_TARGET_ATS, target->ats_len, target->ats)) goto nla_put_failure; genlmsg_end(msg, hdr); return 0; nla_put_failure: genlmsg_cancel(msg, hdr); return -EMSGSIZE; } static struct nfc_dev *__get_device_from_cb(struct netlink_callback *cb) { const struct genl_dumpit_info *info = genl_dumpit_info(cb); struct nfc_dev *dev; u32 idx; if (!info->info.attrs[NFC_ATTR_DEVICE_INDEX]) return ERR_PTR(-EINVAL); idx = nla_get_u32(info->info.attrs[NFC_ATTR_DEVICE_INDEX]); dev = nfc_get_device(idx); if (!dev) return ERR_PTR(-ENODEV); return dev; } static int nfc_genl_dump_targets(struct sk_buff *skb, struct netlink_callback *cb) { int i = cb->args[0]; struct nfc_dev *dev = (struct nfc_dev *) cb->args[1]; int rc; if (!dev) { dev = __get_device_from_cb(cb); if (IS_ERR(dev)) return PTR_ERR(dev); cb->args[1] = (long) dev; } device_lock(&dev->dev); cb->seq = dev->targets_generation; while (i < dev->n_targets) { rc = nfc_genl_send_target(skb, &dev->targets[i], cb, NLM_F_MULTI); if (rc < 0) break; i++; } device_unlock(&dev->dev); cb->args[0] = i; return skb->len; } static int nfc_genl_dump_targets_done(struct netlink_callback *cb) { struct nfc_dev *dev = (struct nfc_dev *) cb->args[1]; if (dev) nfc_put_device(dev); return 0; } int nfc_genl_targets_found(struct nfc_dev *dev) { struct sk_buff *msg; void *hdr; dev->genl_data.poll_req_portid = 0; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC); if (!msg) return -ENOMEM; hdr = genlmsg_put(msg, 0, 0, &nfc_genl_family, 0, NFC_EVENT_TARGETS_FOUND); if (!hdr) goto free_msg; if (nla_put_u32(msg, NFC_ATTR_DEVICE_INDEX, dev->idx)) goto nla_put_failure; genlmsg_end(msg, hdr); return genlmsg_multicast(&nfc_genl_family, msg, 0, 0, GFP_ATOMIC); nla_put_failure: free_msg: nlmsg_free(msg); return -EMSGSIZE; } int nfc_genl_target_lost(struct nfc_dev *dev, u32 target_idx) { struct sk_buff *msg; void *hdr; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return -ENOMEM; hdr = genlmsg_put(msg, 0, 0, &nfc_genl_family, 0, NFC_EVENT_TARGET_LOST); if (!hdr) goto free_msg; if (nla_put_string(msg, NFC_ATTR_DEVICE_NAME, nfc_device_name(dev)) || nla_put_u32(msg, NFC_ATTR_TARGET_INDEX, target_idx)) goto nla_put_failure; genlmsg_end(msg, hdr); genlmsg_multicast(&nfc_genl_family, msg, 0, 0, GFP_KERNEL); return 0; nla_put_failure: free_msg: nlmsg_free(msg); return -EMSGSIZE; } int nfc_genl_tm_activated(struct nfc_dev *dev, u32 protocol) { struct sk_buff *msg; void *hdr; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return -ENOMEM; hdr = genlmsg_put(msg, 0, 0, &nfc_genl_family, 0, NFC_EVENT_TM_ACTIVATED); if (!hdr) goto free_msg; if (nla_put_u32(msg, NFC_ATTR_DEVICE_INDEX, dev->idx)) goto nla_put_failure; if (nla_put_u32(msg, NFC_ATTR_TM_PROTOCOLS, protocol)) goto nla_put_failure; genlmsg_end(msg, hdr); genlmsg_multicast(&nfc_genl_family, msg, 0, 0, GFP_KERNEL); return 0; nla_put_failure: free_msg: nlmsg_free(msg); return -EMSGSIZE; } int nfc_genl_tm_deactivated(struct nfc_dev *dev) { struct sk_buff *msg; void *hdr; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return -ENOMEM; hdr = genlmsg_put(msg, 0, 0, &nfc_genl_family, 0, NFC_EVENT_TM_DEACTIVATED); if (!hdr) goto free_msg; if (nla_put_u32(msg, NFC_ATTR_DEVICE_INDEX, dev->idx)) goto nla_put_failure; genlmsg_end(msg, hdr); genlmsg_multicast(&nfc_genl_family, msg, 0, 0, GFP_KERNEL); return 0; nla_put_failure: free_msg: nlmsg_free(msg); return -EMSGSIZE; } static int nfc_genl_setup_device_added(struct nfc_dev *dev, struct sk_buff *msg) { if (nla_put_string(msg, NFC_ATTR_DEVICE_NAME, nfc_device_name(dev)) || nla_put_u32(msg, NFC_ATTR_DEVICE_INDEX, dev->idx) || nla_put_u32(msg, NFC_ATTR_PROTOCOLS, dev->supported_protocols) || nla_put_u8(msg, NFC_ATTR_DEVICE_POWERED, dev->dev_up) || nla_put_u8(msg, NFC_ATTR_RF_MODE, dev->rf_mode)) return -1; return 0; } int nfc_genl_device_added(struct nfc_dev *dev) { struct sk_buff *msg; void *hdr; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return -ENOMEM; hdr = genlmsg_put(msg, 0, 0, &nfc_genl_family, 0, NFC_EVENT_DEVICE_ADDED); if (!hdr) goto free_msg; if (nfc_genl_setup_device_added(dev, msg)) goto nla_put_failure; genlmsg_end(msg, hdr); genlmsg_multicast(&nfc_genl_family, msg, 0, 0, GFP_KERNEL); return 0; nla_put_failure: free_msg: nlmsg_free(msg); return -EMSGSIZE; } int nfc_genl_device_removed(struct nfc_dev *dev) { struct sk_buff *msg; void *hdr; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return -ENOMEM; hdr = genlmsg_put(msg, 0, 0, &nfc_genl_family, 0, NFC_EVENT_DEVICE_REMOVED); if (!hdr) goto free_msg; if (nla_put_u32(msg, NFC_ATTR_DEVICE_INDEX, dev->idx)) goto nla_put_failure; genlmsg_end(msg, hdr); genlmsg_multicast(&nfc_genl_family, msg, 0, 0, GFP_KERNEL); return 0; nla_put_failure: free_msg: nlmsg_free(msg); return -EMSGSIZE; } int nfc_genl_llc_send_sdres(struct nfc_dev *dev, struct hlist_head *sdres_list) { struct sk_buff *msg; struct nlattr *sdp_attr, *uri_attr; struct nfc_llcp_sdp_tlv *sdres; struct hlist_node *n; void *hdr; int rc = -EMSGSIZE; int i; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return -ENOMEM; hdr = genlmsg_put(msg, 0, 0, &nfc_genl_family, 0, NFC_EVENT_LLC_SDRES); if (!hdr) goto free_msg; if (nla_put_u32(msg, NFC_ATTR_DEVICE_INDEX, dev->idx)) goto nla_put_failure; sdp_attr = nla_nest_start_noflag(msg, NFC_ATTR_LLC_SDP); if (sdp_attr == NULL) { rc = -ENOMEM; goto nla_put_failure; } i = 1; hlist_for_each_entry_safe(sdres, n, sdres_list, node) { pr_debug("uri: %s, sap: %d\n", sdres->uri, sdres->sap); uri_attr = nla_nest_start_noflag(msg, i++); if (uri_attr == NULL) { rc = -ENOMEM; goto nla_put_failure; } if (nla_put_u8(msg, NFC_SDP_ATTR_SAP, sdres->sap)) goto nla_put_failure; if (nla_put_string(msg, NFC_SDP_ATTR_URI, sdres->uri)) goto nla_put_failure; nla_nest_end(msg, uri_attr); hlist_del(&sdres->node); nfc_llcp_free_sdp_tlv(sdres); } nla_nest_end(msg, sdp_attr); genlmsg_end(msg, hdr); return genlmsg_multicast(&nfc_genl_family, msg, 0, 0, GFP_ATOMIC); nla_put_failure: free_msg: nlmsg_free(msg); nfc_llcp_free_sdp_tlv_list(sdres_list); return rc; } int nfc_genl_se_added(struct nfc_dev *dev, u32 se_idx, u16 type) { struct sk_buff *msg; void *hdr; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return -ENOMEM; hdr = genlmsg_put(msg, 0, 0, &nfc_genl_family, 0, NFC_EVENT_SE_ADDED); if (!hdr) goto free_msg; if (nla_put_u32(msg, NFC_ATTR_DEVICE_INDEX, dev->idx) || nla_put_u32(msg, NFC_ATTR_SE_INDEX, se_idx) || nla_put_u8(msg, NFC_ATTR_SE_TYPE, type)) goto nla_put_failure; genlmsg_end(msg, hdr); genlmsg_multicast(&nfc_genl_family, msg, 0, 0, GFP_KERNEL); return 0; nla_put_failure: free_msg: nlmsg_free(msg); return -EMSGSIZE; } int nfc_genl_se_removed(struct nfc_dev *dev, u32 se_idx) { struct sk_buff *msg; void *hdr; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return -ENOMEM; hdr = genlmsg_put(msg, 0, 0, &nfc_genl_family, 0, NFC_EVENT_SE_REMOVED); if (!hdr) goto free_msg; if (nla_put_u32(msg, NFC_ATTR_DEVICE_INDEX, dev->idx) || nla_put_u32(msg, NFC_ATTR_SE_INDEX, se_idx)) goto nla_put_failure; genlmsg_end(msg, hdr); genlmsg_multicast(&nfc_genl_family, msg, 0, 0, GFP_KERNEL); return 0; nla_put_failure: free_msg: nlmsg_free(msg); return -EMSGSIZE; } int nfc_genl_se_transaction(struct nfc_dev *dev, u8 se_idx, struct nfc_evt_transaction *evt_transaction) { struct nfc_se *se; struct sk_buff *msg; void *hdr; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return -ENOMEM; hdr = genlmsg_put(msg, 0, 0, &nfc_genl_family, 0, NFC_EVENT_SE_TRANSACTION); if (!hdr) goto free_msg; se = nfc_find_se(dev, se_idx); if (!se) goto free_msg; if (nla_put_u32(msg, NFC_ATTR_DEVICE_INDEX, dev->idx) || nla_put_u32(msg, NFC_ATTR_SE_INDEX, se_idx) || nla_put_u8(msg, NFC_ATTR_SE_TYPE, se->type) || nla_put(msg, NFC_ATTR_SE_AID, evt_transaction->aid_len, evt_transaction->aid) || nla_put(msg, NFC_ATTR_SE_PARAMS, evt_transaction->params_len, evt_transaction->params)) goto nla_put_failure; /* evt_transaction is no more used */ devm_kfree(&dev->dev, evt_transaction); genlmsg_end(msg, hdr); genlmsg_multicast(&nfc_genl_family, msg, 0, 0, GFP_KERNEL); return 0; nla_put_failure: free_msg: /* evt_transaction is no more used */ devm_kfree(&dev->dev, evt_transaction); nlmsg_free(msg); return -EMSGSIZE; } int nfc_genl_se_connectivity(struct nfc_dev *dev, u8 se_idx) { const struct nfc_se *se; struct sk_buff *msg; void *hdr; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) return -ENOMEM; hdr = genlmsg_put(msg, 0, 0, &nfc_genl_family, 0, NFC_EVENT_SE_CONNECTIVITY); if (!hdr) goto free_msg; se = nfc_find_se(dev, se_idx); if (!se) goto free_msg; if (nla_put_u32(msg, NFC_ATTR_DEVICE_INDEX, dev->idx) || nla_put_u32(msg, NFC_ATTR_SE_INDEX, se_idx) || nla_put_u8(msg, NFC_ATTR_SE_TYPE, se->type)) goto nla_put_failure; genlmsg_end(msg, hdr); genlmsg_multicast(&nfc_genl_family, msg, 0, 0, GFP_KERNEL); return 0; nla_put_failure: free_msg: nlmsg_free(msg); return -EMSGSIZE; } static int nfc_genl_send_device(struct sk_buff *msg, struct nfc_dev *dev, u32 portid, u32 seq, struct netlink_callback *cb, int flags) { void *hdr; hdr = genlmsg_put(msg, portid, seq, &nfc_genl_family, flags, NFC_CMD_GET_DEVICE); if (!hdr) return -EMSGSIZE; if (cb) genl_dump_check_consistent(cb, hdr); if (nfc_genl_setup_device_added(dev, msg)) goto nla_put_failure; genlmsg_end(msg, hdr); return 0; nla_put_failure: genlmsg_cancel(msg, hdr); return -EMSGSIZE; } static int nfc_genl_dump_devices(struct sk_buff *skb, struct netlink_callback *cb) { struct class_dev_iter *iter = (struct class_dev_iter *) cb->args[0]; struct nfc_dev *dev = (struct nfc_dev *) cb->args[1]; bool first_call = false; if (!iter) { first_call = true; iter = kmalloc(sizeof(struct class_dev_iter), GFP_KERNEL); if (!iter) return -ENOMEM; cb->args[0] = (long) iter; } mutex_lock(&nfc_devlist_mutex); cb->seq = nfc_devlist_generation; if (first_call) { nfc_device_iter_init(iter); dev = nfc_device_iter_next(iter); } while (dev) { int rc; rc = nfc_genl_send_device(skb, dev, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, cb, NLM_F_MULTI); if (rc < 0) break; dev = nfc_device_iter_next(iter); } mutex_unlock(&nfc_devlist_mutex); cb->args[1] = (long) dev; return skb->len; } static int nfc_genl_dump_devices_done(struct netlink_callback *cb) { struct class_dev_iter *iter = (struct class_dev_iter *) cb->args[0]; if (iter) { nfc_device_iter_exit(iter); kfree(iter); } return 0; } int nfc_genl_dep_link_up_event(struct nfc_dev *dev, u32 target_idx, u8 comm_mode, u8 rf_mode) { struct sk_buff *msg; void *hdr; pr_debug("DEP link is up\n"); msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC); if (!msg) return -ENOMEM; hdr = genlmsg_put(msg, 0, 0, &nfc_genl_family, 0, NFC_CMD_DEP_LINK_UP); if (!hdr) goto free_msg; if (nla_put_u32(msg, NFC_ATTR_DEVICE_INDEX, dev->idx)) goto nla_put_failure; if (rf_mode == NFC_RF_INITIATOR && nla_put_u32(msg, NFC_ATTR_TARGET_INDEX, target_idx)) goto nla_put_failure; if (nla_put_u8(msg, NFC_ATTR_COMM_MODE, comm_mode) || nla_put_u8(msg, NFC_ATTR_RF_MODE, rf_mode)) goto nla_put_failure; genlmsg_end(msg, hdr); dev->dep_link_up = true; genlmsg_multicast(&nfc_genl_family, msg, 0, 0, GFP_ATOMIC); return 0; nla_put_failure: free_msg: nlmsg_free(msg); return -EMSGSIZE; } int nfc_genl_dep_link_down_event(struct nfc_dev *dev) { struct sk_buff *msg; void *hdr; pr_debug("DEP link is down\n"); msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC); if (!msg) return -ENOMEM; hdr = genlmsg_put(msg, 0, 0, &nfc_genl_family, 0, NFC_CMD_DEP_LINK_DOWN); if (!hdr) goto free_msg; if (nla_put_u32(msg, NFC_ATTR_DEVICE_INDEX, dev->idx)) goto nla_put_failure; genlmsg_end(msg, hdr); genlmsg_multicast(&nfc_genl_family, msg, 0, 0, GFP_ATOMIC); return 0; nla_put_failure: free_msg: nlmsg_free(msg); return -EMSGSIZE; } static int nfc_genl_get_device(struct sk_buff *skb, struct genl_info *info) { struct sk_buff *msg; struct nfc_dev *dev; u32 idx; int rc = -ENOBUFS; if (!info->attrs[NFC_ATTR_DEVICE_INDEX]) return -EINVAL; idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]); dev = nfc_get_device(idx); if (!dev) return -ENODEV; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) { rc = -ENOMEM; goto out_putdev; } rc = nfc_genl_send_device(msg, dev, info->snd_portid, info->snd_seq, NULL, 0); if (rc < 0) goto out_free; nfc_put_device(dev); return genlmsg_reply(msg, info); out_free: nlmsg_free(msg); out_putdev: nfc_put_device(dev); return rc; } static int nfc_genl_dev_up(struct sk_buff *skb, struct genl_info *info) { struct nfc_dev *dev; int rc; u32 idx; if (!info->attrs[NFC_ATTR_DEVICE_INDEX]) return -EINVAL; idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]); dev = nfc_get_device(idx); if (!dev) return -ENODEV; rc = nfc_dev_up(dev); nfc_put_device(dev); return rc; } static int nfc_genl_dev_down(struct sk_buff *skb, struct genl_info *info) { struct nfc_dev *dev; int rc; u32 idx; if (!info->attrs[NFC_ATTR_DEVICE_INDEX]) return -EINVAL; idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]); dev = nfc_get_device(idx); if (!dev) return -ENODEV; rc = nfc_dev_down(dev); nfc_put_device(dev); return rc; } static int nfc_genl_start_poll(struct sk_buff *skb, struct genl_info *info) { struct nfc_dev *dev; int rc; u32 idx; u32 im_protocols = 0, tm_protocols = 0; pr_debug("Poll start\n"); if (!info->attrs[NFC_ATTR_DEVICE_INDEX] || ((!info->attrs[NFC_ATTR_IM_PROTOCOLS] && !info->attrs[NFC_ATTR_PROTOCOLS]) && !info->attrs[NFC_ATTR_TM_PROTOCOLS])) return -EINVAL; idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]); if (info->attrs[NFC_ATTR_TM_PROTOCOLS]) tm_protocols = nla_get_u32(info->attrs[NFC_ATTR_TM_PROTOCOLS]); if (info->attrs[NFC_ATTR_IM_PROTOCOLS]) im_protocols = nla_get_u32(info->attrs[NFC_ATTR_IM_PROTOCOLS]); else if (info->attrs[NFC_ATTR_PROTOCOLS]) im_protocols = nla_get_u32(info->attrs[NFC_ATTR_PROTOCOLS]); dev = nfc_get_device(idx); if (!dev) return -ENODEV; mutex_lock(&dev->genl_data.genl_data_mutex); rc = nfc_start_poll(dev, im_protocols, tm_protocols); if (!rc) dev->genl_data.poll_req_portid = info->snd_portid; mutex_unlock(&dev->genl_data.genl_data_mutex); nfc_put_device(dev); return rc; } static int nfc_genl_stop_poll(struct sk_buff *skb, struct genl_info *info) { struct nfc_dev *dev; int rc; u32 idx; if (!info->attrs[NFC_ATTR_DEVICE_INDEX]) return -EINVAL; idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]); dev = nfc_get_device(idx); if (!dev) return -ENODEV; device_lock(&dev->dev); if (!dev->polling) { device_unlock(&dev->dev); nfc_put_device(dev); return -EINVAL; } device_unlock(&dev->dev); mutex_lock(&dev->genl_data.genl_data_mutex); if (dev->genl_data.poll_req_portid != info->snd_portid) { rc = -EBUSY; goto out; } rc = nfc_stop_poll(dev); dev->genl_data.poll_req_portid = 0; out: mutex_unlock(&dev->genl_data.genl_data_mutex); nfc_put_device(dev); return rc; } static int nfc_genl_activate_target(struct sk_buff *skb, struct genl_info *info) { struct nfc_dev *dev; u32 device_idx, target_idx, protocol; int rc; if (!info->attrs[NFC_ATTR_DEVICE_INDEX] || !info->attrs[NFC_ATTR_TARGET_INDEX] || !info->attrs[NFC_ATTR_PROTOCOLS]) return -EINVAL; device_idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]); dev = nfc_get_device(device_idx); if (!dev) return -ENODEV; target_idx = nla_get_u32(info->attrs[NFC_ATTR_TARGET_INDEX]); protocol = nla_get_u32(info->attrs[NFC_ATTR_PROTOCOLS]); nfc_deactivate_target(dev, target_idx, NFC_TARGET_MODE_SLEEP); rc = nfc_activate_target(dev, target_idx, protocol); nfc_put_device(dev); return rc; } static int nfc_genl_deactivate_target(struct sk_buff *skb, struct genl_info *info) { struct nfc_dev *dev; u32 device_idx, target_idx; int rc; if (!info->attrs[NFC_ATTR_DEVICE_INDEX] || !info->attrs[NFC_ATTR_TARGET_INDEX]) return -EINVAL; device_idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]); dev = nfc_get_device(device_idx); if (!dev) return -ENODEV; target_idx = nla_get_u32(info->attrs[NFC_ATTR_TARGET_INDEX]); rc = nfc_deactivate_target(dev, target_idx, NFC_TARGET_MODE_SLEEP); nfc_put_device(dev); return rc; } static int nfc_genl_dep_link_up(struct sk_buff *skb, struct genl_info *info) { struct nfc_dev *dev; int rc, tgt_idx; u32 idx; u8 comm; pr_debug("DEP link up\n"); if (!info->attrs[NFC_ATTR_DEVICE_INDEX] || !info->attrs[NFC_ATTR_COMM_MODE]) return -EINVAL; idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]); if (!info->attrs[NFC_ATTR_TARGET_INDEX]) tgt_idx = NFC_TARGET_IDX_ANY; else tgt_idx = nla_get_u32(info->attrs[NFC_ATTR_TARGET_INDEX]); comm = nla_get_u8(info->attrs[NFC_ATTR_COMM_MODE]); if (comm != NFC_COMM_ACTIVE && comm != NFC_COMM_PASSIVE) return -EINVAL; dev = nfc_get_device(idx); if (!dev) return -ENODEV; rc = nfc_dep_link_up(dev, tgt_idx, comm); nfc_put_device(dev); return rc; } static int nfc_genl_dep_link_down(struct sk_buff *skb, struct genl_info *info) { struct nfc_dev *dev; int rc; u32 idx; if (!info->attrs[NFC_ATTR_DEVICE_INDEX]) return -EINVAL; idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]); dev = nfc_get_device(idx); if (!dev) return -ENODEV; rc = nfc_dep_link_down(dev); nfc_put_device(dev); return rc; } static int nfc_genl_send_params(struct sk_buff *msg, struct nfc_llcp_local *local, u32 portid, u32 seq) { void *hdr; hdr = genlmsg_put(msg, portid, seq, &nfc_genl_family, 0, NFC_CMD_LLC_GET_PARAMS); if (!hdr) return -EMSGSIZE; if (nla_put_u32(msg, NFC_ATTR_DEVICE_INDEX, local->dev->idx) || nla_put_u8(msg, NFC_ATTR_LLC_PARAM_LTO, local->lto) || nla_put_u8(msg, NFC_ATTR_LLC_PARAM_RW, local->rw) || nla_put_u16(msg, NFC_ATTR_LLC_PARAM_MIUX, be16_to_cpu(local->miux))) goto nla_put_failure; genlmsg_end(msg, hdr); return 0; nla_put_failure: genlmsg_cancel(msg, hdr); return -EMSGSIZE; } static int nfc_genl_llc_get_params(struct sk_buff *skb, struct genl_info *info) { struct nfc_dev *dev; struct nfc_llcp_local *local; int rc = 0; struct sk_buff *msg = NULL; u32 idx; if (!info->attrs[NFC_ATTR_DEVICE_INDEX]) return -EINVAL; idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]); dev = nfc_get_device(idx); if (!dev) return -ENODEV; device_lock(&dev->dev); local = nfc_llcp_find_local(dev); if (!local) { rc = -ENODEV; goto exit; } msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) { rc = -ENOMEM; goto put_local; } rc = nfc_genl_send_params(msg, local, info->snd_portid, info->snd_seq); put_local: nfc_llcp_local_put(local); exit: device_unlock(&dev->dev); nfc_put_device(dev); if (rc < 0) { if (msg) nlmsg_free(msg); return rc; } return genlmsg_reply(msg, info); } static int nfc_genl_llc_set_params(struct sk_buff *skb, struct genl_info *info) { struct nfc_dev *dev; struct nfc_llcp_local *local; u8 rw = 0; u16 miux = 0; u32 idx; int rc = 0; if (!info->attrs[NFC_ATTR_DEVICE_INDEX] || (!info->attrs[NFC_ATTR_LLC_PARAM_LTO] && !info->attrs[NFC_ATTR_LLC_PARAM_RW] && !info->attrs[NFC_ATTR_LLC_PARAM_MIUX])) return -EINVAL; if (info->attrs[NFC_ATTR_LLC_PARAM_RW]) { rw = nla_get_u8(info->attrs[NFC_ATTR_LLC_PARAM_RW]); if (rw > LLCP_MAX_RW) return -EINVAL; } if (info->attrs[NFC_ATTR_LLC_PARAM_MIUX]) { miux = nla_get_u16(info->attrs[NFC_ATTR_LLC_PARAM_MIUX]); if (miux > LLCP_MAX_MIUX) return -EINVAL; } idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]); dev = nfc_get_device(idx); if (!dev) return -ENODEV; device_lock(&dev->dev); local = nfc_llcp_find_local(dev); if (!local) { rc = -ENODEV; goto exit; } if (info->attrs[NFC_ATTR_LLC_PARAM_LTO]) { if (dev->dep_link_up) { rc = -EINPROGRESS; goto put_local; } local->lto = nla_get_u8(info->attrs[NFC_ATTR_LLC_PARAM_LTO]); } if (info->attrs[NFC_ATTR_LLC_PARAM_RW]) local->rw = rw; if (info->attrs[NFC_ATTR_LLC_PARAM_MIUX]) local->miux = cpu_to_be16(miux); put_local: nfc_llcp_local_put(local); exit: device_unlock(&dev->dev); nfc_put_device(dev); return rc; } static int nfc_genl_llc_sdreq(struct sk_buff *skb, struct genl_info *info) { struct nfc_dev *dev; struct nfc_llcp_local *local; struct nlattr *attr, *sdp_attrs[NFC_SDP_ATTR_MAX+1]; u32 idx; u8 tid; char *uri; int rc = 0, rem; size_t uri_len, tlvs_len; struct hlist_head sdreq_list; struct nfc_llcp_sdp_tlv *sdreq; if (!info->attrs[NFC_ATTR_DEVICE_INDEX] || !info->attrs[NFC_ATTR_LLC_SDP]) return -EINVAL; idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]); dev = nfc_get_device(idx); if (!dev) return -ENODEV; device_lock(&dev->dev); if (dev->dep_link_up == false) { rc = -ENOLINK; goto exit; } local = nfc_llcp_find_local(dev); if (!local) { rc = -ENODEV; goto exit; } INIT_HLIST_HEAD(&sdreq_list); tlvs_len = 0; nla_for_each_nested(attr, info->attrs[NFC_ATTR_LLC_SDP], rem) { rc = nla_parse_nested_deprecated(sdp_attrs, NFC_SDP_ATTR_MAX, attr, nfc_sdp_genl_policy, info->extack); if (rc != 0) { rc = -EINVAL; goto put_local; } if (!sdp_attrs[NFC_SDP_ATTR_URI]) continue; uri_len = nla_len(sdp_attrs[NFC_SDP_ATTR_URI]); if (uri_len == 0) continue; uri = nla_data(sdp_attrs[NFC_SDP_ATTR_URI]); if (*uri == 0) continue; tid = local->sdreq_next_tid++; sdreq = nfc_llcp_build_sdreq_tlv(tid, uri, uri_len); if (sdreq == NULL) { rc = -ENOMEM; goto put_local; } tlvs_len += sdreq->tlv_len; hlist_add_head(&sdreq->node, &sdreq_list); } if (hlist_empty(&sdreq_list)) { rc = -EINVAL; goto put_local; } rc = nfc_llcp_send_snl_sdreq(local, &sdreq_list, tlvs_len); put_local: nfc_llcp_local_put(local); exit: device_unlock(&dev->dev); nfc_put_device(dev); return rc; } static int nfc_genl_fw_download(struct sk_buff *skb, struct genl_info *info) { struct nfc_dev *dev; int rc; u32 idx; char firmware_name[NFC_FIRMWARE_NAME_MAXSIZE + 1]; if (!info->attrs[NFC_ATTR_DEVICE_INDEX] || !info->attrs[NFC_ATTR_FIRMWARE_NAME]) return -EINVAL; idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]); dev = nfc_get_device(idx); if (!dev) return -ENODEV; nla_strscpy(firmware_name, info->attrs[NFC_ATTR_FIRMWARE_NAME], sizeof(firmware_name)); rc = nfc_fw_download(dev, firmware_name); nfc_put_device(dev); return rc; } int nfc_genl_fw_download_done(struct nfc_dev *dev, const char *firmware_name, u32 result) { struct sk_buff *msg; void *hdr; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC); if (!msg) return -ENOMEM; hdr = genlmsg_put(msg, 0, 0, &nfc_genl_family, 0, NFC_CMD_FW_DOWNLOAD); if (!hdr) goto free_msg; if (nla_put_string(msg, NFC_ATTR_FIRMWARE_NAME, firmware_name) || nla_put_u32(msg, NFC_ATTR_FIRMWARE_DOWNLOAD_STATUS, result) || nla_put_u32(msg, NFC_ATTR_DEVICE_INDEX, dev->idx)) goto nla_put_failure; genlmsg_end(msg, hdr); genlmsg_multicast(&nfc_genl_family, msg, 0, 0, GFP_ATOMIC); return 0; nla_put_failure: free_msg: nlmsg_free(msg); return -EMSGSIZE; } static int nfc_genl_enable_se(struct sk_buff *skb, struct genl_info *info) { struct nfc_dev *dev; int rc; u32 idx, se_idx; if (!info->attrs[NFC_ATTR_DEVICE_INDEX] || !info->attrs[NFC_ATTR_SE_INDEX]) return -EINVAL; idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]); se_idx = nla_get_u32(info->attrs[NFC_ATTR_SE_INDEX]); dev = nfc_get_device(idx); if (!dev) return -ENODEV; rc = nfc_enable_se(dev, se_idx); nfc_put_device(dev); return rc; } static int nfc_genl_disable_se(struct sk_buff *skb, struct genl_info *info) { struct nfc_dev *dev; int rc; u32 idx, se_idx; if (!info->attrs[NFC_ATTR_DEVICE_INDEX] || !info->attrs[NFC_ATTR_SE_INDEX]) return -EINVAL; idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]); se_idx = nla_get_u32(info->attrs[NFC_ATTR_SE_INDEX]); dev = nfc_get_device(idx); if (!dev) return -ENODEV; rc = nfc_disable_se(dev, se_idx); nfc_put_device(dev); return rc; } static int nfc_genl_send_se(struct sk_buff *msg, struct nfc_dev *dev, u32 portid, u32 seq, struct netlink_callback *cb, int flags) { void *hdr; struct nfc_se *se, *n; list_for_each_entry_safe(se, n, &dev->secure_elements, list) { hdr = genlmsg_put(msg, portid, seq, &nfc_genl_family, flags, NFC_CMD_GET_SE); if (!hdr) goto nla_put_failure; if (cb) genl_dump_check_consistent(cb, hdr); if (nla_put_u32(msg, NFC_ATTR_DEVICE_INDEX, dev->idx) || nla_put_u32(msg, NFC_ATTR_SE_INDEX, se->idx) || nla_put_u8(msg, NFC_ATTR_SE_TYPE, se->type)) goto nla_put_failure; genlmsg_end(msg, hdr); } return 0; nla_put_failure: genlmsg_cancel(msg, hdr); return -EMSGSIZE; } static int nfc_genl_dump_ses(struct sk_buff *skb, struct netlink_callback *cb) { struct class_dev_iter *iter = (struct class_dev_iter *) cb->args[0]; struct nfc_dev *dev = (struct nfc_dev *) cb->args[1]; bool first_call = false; if (!iter) { first_call = true; iter = kmalloc(sizeof(struct class_dev_iter), GFP_KERNEL); if (!iter) return -ENOMEM; cb->args[0] = (long) iter; } mutex_lock(&nfc_devlist_mutex); cb->seq = nfc_devlist_generation; if (first_call) { nfc_device_iter_init(iter); dev = nfc_device_iter_next(iter); } while (dev) { int rc; rc = nfc_genl_send_se(skb, dev, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, cb, NLM_F_MULTI); if (rc < 0) break; dev = nfc_device_iter_next(iter); } mutex_unlock(&nfc_devlist_mutex); cb->args[1] = (long) dev; return skb->len; } static int nfc_genl_dump_ses_done(struct netlink_callback *cb) { struct class_dev_iter *iter = (struct class_dev_iter *) cb->args[0]; if (iter) { nfc_device_iter_exit(iter); kfree(iter); } return 0; } static int nfc_se_io(struct nfc_dev *dev, u32 se_idx, u8 *apdu, size_t apdu_length, se_io_cb_t cb, void *cb_context) { struct nfc_se *se; int rc; pr_debug("%s se index %d\n", dev_name(&dev->dev), se_idx); device_lock(&dev->dev); if (!device_is_registered(&dev->dev)) { rc = -ENODEV; goto error; } if (!dev->dev_up) { rc = -ENODEV; goto error; } if (!dev->ops->se_io) { rc = -EOPNOTSUPP; goto error; } se = nfc_find_se(dev, se_idx); if (!se) { rc = -EINVAL; goto error; } if (se->state != NFC_SE_ENABLED) { rc = -ENODEV; goto error; } rc = dev->ops->se_io(dev, se_idx, apdu, apdu_length, cb, cb_context); device_unlock(&dev->dev); return rc; error: device_unlock(&dev->dev); kfree(cb_context); return rc; } struct se_io_ctx { u32 dev_idx; u32 se_idx; }; static void se_io_cb(void *context, u8 *apdu, size_t apdu_len, int err) { struct se_io_ctx *ctx = context; struct sk_buff *msg; void *hdr; msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); if (!msg) { kfree(ctx); return; } hdr = genlmsg_put(msg, 0, 0, &nfc_genl_family, 0, NFC_CMD_SE_IO); if (!hdr) goto free_msg; if (nla_put_u32(msg, NFC_ATTR_DEVICE_INDEX, ctx->dev_idx) || nla_put_u32(msg, NFC_ATTR_SE_INDEX, ctx->se_idx) || nla_put(msg, NFC_ATTR_SE_APDU, apdu_len, apdu)) goto nla_put_failure; genlmsg_end(msg, hdr); genlmsg_multicast(&nfc_genl_family, msg, 0, 0, GFP_KERNEL); kfree(ctx); return; nla_put_failure: free_msg: nlmsg_free(msg); kfree(ctx); return; } static int nfc_genl_se_io(struct sk_buff *skb, struct genl_info *info) { struct nfc_dev *dev; struct se_io_ctx *ctx; u32 dev_idx, se_idx; u8 *apdu; size_t apdu_len; int rc; if (!info->attrs[NFC_ATTR_DEVICE_INDEX] || !info->attrs[NFC_ATTR_SE_INDEX] || !info->attrs[NFC_ATTR_SE_APDU]) return -EINVAL; dev_idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]); se_idx = nla_get_u32(info->attrs[NFC_ATTR_SE_INDEX]); dev = nfc_get_device(dev_idx); if (!dev) return -ENODEV; if (!dev->ops || !dev->ops->se_io) { rc = -EOPNOTSUPP; goto put_dev; } apdu_len = nla_len(info->attrs[NFC_ATTR_SE_APDU]); if (apdu_len == 0) { rc = -EINVAL; goto put_dev; } apdu = nla_data(info->attrs[NFC_ATTR_SE_APDU]); ctx = kzalloc(sizeof(struct se_io_ctx), GFP_KERNEL); if (!ctx) { rc = -ENOMEM; goto put_dev; } ctx->dev_idx = dev_idx; ctx->se_idx = se_idx; rc = nfc_se_io(dev, se_idx, apdu, apdu_len, se_io_cb, ctx); put_dev: nfc_put_device(dev); return rc; } static int nfc_genl_vendor_cmd(struct sk_buff *skb, struct genl_info *info) { struct nfc_dev *dev; const struct nfc_vendor_cmd *cmd; u32 dev_idx, vid, subcmd; u8 *data; size_t data_len; int i, err; if (!info->attrs[NFC_ATTR_DEVICE_INDEX] || !info->attrs[NFC_ATTR_VENDOR_ID] || !info->attrs[NFC_ATTR_VENDOR_SUBCMD]) return -EINVAL; dev_idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]); vid = nla_get_u32(info->attrs[NFC_ATTR_VENDOR_ID]); subcmd = nla_get_u32(info->attrs[NFC_ATTR_VENDOR_SUBCMD]); dev = nfc_get_device(dev_idx); if (!dev) return -ENODEV; if (!dev->vendor_cmds || !dev->n_vendor_cmds) { err = -ENODEV; goto put_dev; } if (info->attrs[NFC_ATTR_VENDOR_DATA]) { data = nla_data(info->attrs[NFC_ATTR_VENDOR_DATA]); data_len = nla_len(info->attrs[NFC_ATTR_VENDOR_DATA]); if (data_len == 0) { err = -EINVAL; goto put_dev; } } else { data = NULL; data_len = 0; } for (i = 0; i < dev->n_vendor_cmds; i++) { cmd = &dev->vendor_cmds[i]; if (cmd->vendor_id != vid || cmd->subcmd != subcmd) continue; dev->cur_cmd_info = info; err = cmd->doit(dev, data, data_len); dev->cur_cmd_info = NULL; goto put_dev; } err = -EOPNOTSUPP; put_dev: nfc_put_device(dev); return err; } /* message building helper */ static inline void *nfc_hdr_put(struct sk_buff *skb, u32 portid, u32 seq, int flags, u8 cmd) { /* since there is no private header just add the generic one */ return genlmsg_put(skb, portid, seq, &nfc_genl_family, flags, cmd); } static struct sk_buff * __nfc_alloc_vendor_cmd_skb(struct nfc_dev *dev, int approxlen, u32 portid, u32 seq, enum nfc_attrs attr, u32 oui, u32 subcmd, gfp_t gfp) { struct sk_buff *skb; void *hdr; skb = nlmsg_new(approxlen + 100, gfp); if (!skb) return NULL; hdr = nfc_hdr_put(skb, portid, seq, 0, NFC_CMD_VENDOR); if (!hdr) { kfree_skb(skb); return NULL; } if (nla_put_u32(skb, NFC_ATTR_DEVICE_INDEX, dev->idx)) goto nla_put_failure; if (nla_put_u32(skb, NFC_ATTR_VENDOR_ID, oui)) goto nla_put_failure; if (nla_put_u32(skb, NFC_ATTR_VENDOR_SUBCMD, subcmd)) goto nla_put_failure; ((void **)skb->cb)[0] = dev; ((void **)skb->cb)[1] = hdr; return skb; nla_put_failure: kfree_skb(skb); return NULL; } struct sk_buff *__nfc_alloc_vendor_cmd_reply_skb(struct nfc_dev *dev, enum nfc_attrs attr, u32 oui, u32 subcmd, int approxlen) { if (WARN_ON(!dev->cur_cmd_info)) return NULL; return __nfc_alloc_vendor_cmd_skb(dev, approxlen, dev->cur_cmd_info->snd_portid, dev->cur_cmd_info->snd_seq, attr, oui, subcmd, GFP_KERNEL); } EXPORT_SYMBOL(__nfc_alloc_vendor_cmd_reply_skb); int nfc_vendor_cmd_reply(struct sk_buff *skb) { struct nfc_dev *dev = ((void **)skb->cb)[0]; void *hdr = ((void **)skb->cb)[1]; /* clear CB data for netlink core to own from now on */ memset(skb->cb, 0, sizeof(skb->cb)); if (WARN_ON(!dev->cur_cmd_info)) { kfree_skb(skb); return -EINVAL; } genlmsg_end(skb, hdr); return genlmsg_reply(skb, dev->cur_cmd_info); } EXPORT_SYMBOL(nfc_vendor_cmd_reply); static const struct genl_ops nfc_genl_ops[] = { { .cmd = NFC_CMD_GET_DEVICE, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = nfc_genl_get_device, .dumpit = nfc_genl_dump_devices, .done = nfc_genl_dump_devices_done, }, { .cmd = NFC_CMD_DEV_UP, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = nfc_genl_dev_up, .flags = GENL_ADMIN_PERM, }, { .cmd = NFC_CMD_DEV_DOWN, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = nfc_genl_dev_down, .flags = GENL_ADMIN_PERM, }, { .cmd = NFC_CMD_START_POLL, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = nfc_genl_start_poll, .flags = GENL_ADMIN_PERM, }, { .cmd = NFC_CMD_STOP_POLL, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = nfc_genl_stop_poll, .flags = GENL_ADMIN_PERM, }, { .cmd = NFC_CMD_DEP_LINK_UP, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = nfc_genl_dep_link_up, .flags = GENL_ADMIN_PERM, }, { .cmd = NFC_CMD_DEP_LINK_DOWN, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = nfc_genl_dep_link_down, .flags = GENL_ADMIN_PERM, }, { .cmd = NFC_CMD_GET_TARGET, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP_STRICT, .dumpit = nfc_genl_dump_targets, .done = nfc_genl_dump_targets_done, }, { .cmd = NFC_CMD_LLC_GET_PARAMS, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = nfc_genl_llc_get_params, }, { .cmd = NFC_CMD_LLC_SET_PARAMS, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = nfc_genl_llc_set_params, .flags = GENL_ADMIN_PERM, }, { .cmd = NFC_CMD_LLC_SDREQ, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = nfc_genl_llc_sdreq, .flags = GENL_ADMIN_PERM, }, { .cmd = NFC_CMD_FW_DOWNLOAD, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = nfc_genl_fw_download, .flags = GENL_ADMIN_PERM, }, { .cmd = NFC_CMD_ENABLE_SE, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = nfc_genl_enable_se, .flags = GENL_ADMIN_PERM, }, { .cmd = NFC_CMD_DISABLE_SE, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = nfc_genl_disable_se, .flags = GENL_ADMIN_PERM, }, { .cmd = NFC_CMD_GET_SE, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .dumpit = nfc_genl_dump_ses, .done = nfc_genl_dump_ses_done, }, { .cmd = NFC_CMD_SE_IO, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = nfc_genl_se_io, .flags = GENL_ADMIN_PERM, }, { .cmd = NFC_CMD_ACTIVATE_TARGET, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = nfc_genl_activate_target, .flags = GENL_ADMIN_PERM, }, { .cmd = NFC_CMD_VENDOR, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = nfc_genl_vendor_cmd, .flags = GENL_ADMIN_PERM, }, { .cmd = NFC_CMD_DEACTIVATE_TARGET, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = nfc_genl_deactivate_target, .flags = GENL_ADMIN_PERM, }, }; static struct genl_family nfc_genl_family __ro_after_init = { .hdrsize = 0, .name = NFC_GENL_NAME, .version = NFC_GENL_VERSION, .maxattr = NFC_ATTR_MAX, .policy = nfc_genl_policy, .module = THIS_MODULE, .ops = nfc_genl_ops, .n_ops = ARRAY_SIZE(nfc_genl_ops), .resv_start_op = NFC_CMD_DEACTIVATE_TARGET + 1, .mcgrps = nfc_genl_mcgrps, .n_mcgrps = ARRAY_SIZE(nfc_genl_mcgrps), }; struct urelease_work { struct work_struct w; u32 portid; }; static void nfc_urelease_event_work(struct work_struct *work) { struct urelease_work *w = container_of(work, struct urelease_work, w); struct class_dev_iter iter; struct nfc_dev *dev; pr_debug("portid %d\n", w->portid); mutex_lock(&nfc_devlist_mutex); nfc_device_iter_init(&iter); dev = nfc_device_iter_next(&iter); while (dev) { mutex_lock(&dev->genl_data.genl_data_mutex); if (dev->genl_data.poll_req_portid == w->portid) { nfc_stop_poll(dev); dev->genl_data.poll_req_portid = 0; } mutex_unlock(&dev->genl_data.genl_data_mutex); dev = nfc_device_iter_next(&iter); } nfc_device_iter_exit(&iter); mutex_unlock(&nfc_devlist_mutex); kfree(w); } static int nfc_genl_rcv_nl_event(struct notifier_block *this, unsigned long event, void *ptr) { struct netlink_notify *n = ptr; struct urelease_work *w; if (event != NETLINK_URELEASE || n->protocol != NETLINK_GENERIC) goto out; pr_debug("NETLINK_URELEASE event from id %d\n", n->portid); w = kmalloc(sizeof(*w), GFP_ATOMIC); if (w) { INIT_WORK(&w->w, nfc_urelease_event_work); w->portid = n->portid; schedule_work(&w->w); } out: return NOTIFY_DONE; } void nfc_genl_data_init(struct nfc_genl_data *genl_data) { genl_data->poll_req_portid = 0; mutex_init(&genl_data->genl_data_mutex); } void nfc_genl_data_exit(struct nfc_genl_data *genl_data) { mutex_destroy(&genl_data->genl_data_mutex); } static struct notifier_block nl_notifier = { .notifier_call = nfc_genl_rcv_nl_event, }; /** * nfc_genl_init() - Initialize netlink interface * * This initialization function registers the nfc netlink family. */ int __init nfc_genl_init(void) { int rc; rc = genl_register_family(&nfc_genl_family); if (rc) return rc; netlink_register_notifier(&nl_notifier); return 0; } /** * nfc_genl_exit() - Deinitialize netlink interface * * This exit function unregisters the nfc netlink family. */ void nfc_genl_exit(void) { netlink_unregister_notifier(&nl_notifier); genl_unregister_family(&nfc_genl_family); } |
| 284 41 1 18 7 12 43 97 6 1 93 2328 3 4 1 1 1 1 1 1 2 55 62 59 61 59 83 113 1946 38 192 193 192 2 23 23 192 148 80 265 61 62 62 61 61 1 3 22 41 61 61 60 29 39 61 4 1 35 62 62 51 31 31 31 62 62 9 9 8 61 61 117 117 117 90 49 18 3 349 62 60 2 2 38 38 38 1 105 271 8 5 93 4 2 1 8 2 4 2 4 2174 2173 4 32 4 18 1308 1968 1504 1802 1 3 1 1615 300 2145 18 2173 260 56 207 2 11 257 11 2 18 17 18 18 18 42 4 157 160 57 236 37 236 235 9 259 265 266 265 237 37 237 37 266 262 8 264 237 36 265 265 132 145 266 265 152 127 127 127 9 118 84 1 15 15 57 84 13 13 13 3 15 10 90 89 88 33 1 52 80 80 24 2 2 2 18 14 24 24 88 35 9 1815 1817 90 90 39 55 43 57 66 121 99 90 33 93 13 92 16 7 92 92 87 87 27 27 52 39 38 2 10 38 1 38 1 39 1 38 43 18 104 105 1 104 72 35 3 32 335 13 331 352 7 24 3 6 338 7 278 3 58 333 91 266 43 8 37 43 333 2 2 266 1 4 1 84 332 43 349 358 357 5 3 2 2 198 70 10 121 110 44 44 4 139 88 72 88 76 65 1 10 72 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 | // SPDX-License-Identifier: GPL-2.0-or-later /* * fs/eventpoll.c (Efficient event retrieval implementation) * Copyright (C) 2001,...,2009 Davide Libenzi * * Davide Libenzi <davidel@xmailserver.org> */ #include <linux/init.h> #include <linux/kernel.h> #include <linux/sched/signal.h> #include <linux/fs.h> #include <linux/file.h> #include <linux/signal.h> #include <linux/errno.h> #include <linux/mm.h> #include <linux/slab.h> #include <linux/poll.h> #include <linux/string.h> #include <linux/list.h> #include <linux/hash.h> #include <linux/spinlock.h> #include <linux/syscalls.h> #include <linux/rbtree.h> #include <linux/wait.h> #include <linux/eventpoll.h> #include <linux/mount.h> #include <linux/bitops.h> #include <linux/mutex.h> #include <linux/anon_inodes.h> #include <linux/device.h> #include <linux/uaccess.h> #include <asm/io.h> #include <asm/mman.h> #include <linux/atomic.h> #include <linux/proc_fs.h> #include <linux/seq_file.h> #include <linux/compat.h> #include <linux/rculist.h> #include <linux/capability.h> #include <net/busy_poll.h> /* * LOCKING: * There are three level of locking required by epoll : * * 1) epnested_mutex (mutex) * 2) ep->mtx (mutex) * 3) ep->lock (rwlock) * * The acquire order is the one listed above, from 1 to 3. * We need a rwlock (ep->lock) because we manipulate objects * from inside the poll callback, that might be triggered from * a wake_up() that in turn might be called from IRQ context. * So we can't sleep inside the poll callback and hence we need * a spinlock. During the event transfer loop (from kernel to * user space) we could end up sleeping due a copy_to_user(), so * we need a lock that will allow us to sleep. This lock is a * mutex (ep->mtx). It is acquired during the event transfer loop, * during epoll_ctl(EPOLL_CTL_DEL) and during eventpoll_release_file(). * The epnested_mutex is acquired when inserting an epoll fd onto another * epoll fd. We do this so that we walk the epoll tree and ensure that this * insertion does not create a cycle of epoll file descriptors, which * could lead to deadlock. We need a global mutex to prevent two * simultaneous inserts (A into B and B into A) from racing and * constructing a cycle without either insert observing that it is * going to. * It is necessary to acquire multiple "ep->mtx"es at once in the * case when one epoll fd is added to another. In this case, we * always acquire the locks in the order of nesting (i.e. after * epoll_ctl(e1, EPOLL_CTL_ADD, e2), e1->mtx will always be acquired * before e2->mtx). Since we disallow cycles of epoll file * descriptors, this ensures that the mutexes are well-ordered. In * order to communicate this nesting to lockdep, when walking a tree * of epoll file descriptors, we use the current recursion depth as * the lockdep subkey. * It is possible to drop the "ep->mtx" and to use the global * mutex "epnested_mutex" (together with "ep->lock") to have it working, * but having "ep->mtx" will make the interface more scalable. * Events that require holding "epnested_mutex" are very rare, while for * normal operations the epoll private "ep->mtx" will guarantee * a better scalability. */ /* Epoll private bits inside the event mask */ #define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | EPOLLEXCLUSIVE) #define EPOLLINOUT_BITS (EPOLLIN | EPOLLOUT) #define EPOLLEXCLUSIVE_OK_BITS (EPOLLINOUT_BITS | EPOLLERR | EPOLLHUP | \ EPOLLWAKEUP | EPOLLET | EPOLLEXCLUSIVE) /* Maximum number of nesting allowed inside epoll sets */ #define EP_MAX_NESTS 4 #define EP_MAX_EVENTS (INT_MAX / sizeof(struct epoll_event)) #define EP_UNACTIVE_PTR ((void *) -1L) #define EP_ITEM_COST (sizeof(struct epitem) + sizeof(struct eppoll_entry)) struct epoll_filefd { struct file *file; int fd; } __packed; /* Wait structure used by the poll hooks */ struct eppoll_entry { /* List header used to link this structure to the "struct epitem" */ struct eppoll_entry *next; /* The "base" pointer is set to the container "struct epitem" */ struct epitem *base; /* * Wait queue item that will be linked to the target file wait * queue head. */ wait_queue_entry_t wait; /* The wait queue head that linked the "wait" wait queue item */ wait_queue_head_t *whead; }; /* * Each file descriptor added to the eventpoll interface will * have an entry of this type linked to the "rbr" RB tree. * Avoid increasing the size of this struct, there can be many thousands * of these on a server and we do not want this to take another cache line. */ struct epitem { union { /* RB tree node links this structure to the eventpoll RB tree */ struct rb_node rbn; /* Used to free the struct epitem */ struct rcu_head rcu; }; /* List header used to link this structure to the eventpoll ready list */ struct list_head rdllink; /* * Works together "struct eventpoll"->ovflist in keeping the * single linked chain of items. */ struct epitem *next; /* The file descriptor information this item refers to */ struct epoll_filefd ffd; /* * Protected by file->f_lock, true for to-be-released epitem already * removed from the "struct file" items list; together with * eventpoll->refcount orchestrates "struct eventpoll" disposal */ bool dying; /* List containing poll wait queues */ struct eppoll_entry *pwqlist; /* The "container" of this item */ struct eventpoll *ep; /* List header used to link this item to the "struct file" items list */ struct hlist_node fllink; /* wakeup_source used when EPOLLWAKEUP is set */ struct wakeup_source __rcu *ws; /* The structure that describe the interested events and the source fd */ struct epoll_event event; }; /* * This structure is stored inside the "private_data" member of the file * structure and represents the main data structure for the eventpoll * interface. */ struct eventpoll { /* * This mutex is used to ensure that files are not removed * while epoll is using them. This is held during the event * collection loop, the file cleanup path, the epoll file exit * code and the ctl operations. */ struct mutex mtx; /* Wait queue used by sys_epoll_wait() */ wait_queue_head_t wq; /* Wait queue used by file->poll() */ wait_queue_head_t poll_wait; /* List of ready file descriptors */ struct list_head rdllist; /* Lock which protects rdllist and ovflist */ rwlock_t lock; /* RB tree root used to store monitored fd structs */ struct rb_root_cached rbr; /* * This is a single linked list that chains all the "struct epitem" that * happened while transferring ready events to userspace w/out * holding ->lock. */ struct epitem *ovflist; /* wakeup_source used when ep_send_events or __ep_eventpoll_poll is running */ struct wakeup_source *ws; /* The user that created the eventpoll descriptor */ struct user_struct *user; struct file *file; /* used to optimize loop detection check */ u64 gen; struct hlist_head refs; u8 loop_check_depth; /* * usage count, used together with epitem->dying to * orchestrate the disposal of this struct */ refcount_t refcount; #ifdef CONFIG_NET_RX_BUSY_POLL /* used to track busy poll napi_id */ unsigned int napi_id; /* busy poll timeout */ u32 busy_poll_usecs; /* busy poll packet budget */ u16 busy_poll_budget; bool prefer_busy_poll; #endif #ifdef CONFIG_DEBUG_LOCK_ALLOC /* tracks wakeup nests for lockdep validation */ u8 nests; #endif }; /* Wrapper struct used by poll queueing */ struct ep_pqueue { poll_table pt; struct epitem *epi; }; /* * Configuration options available inside /proc/sys/fs/epoll/ */ /* Maximum number of epoll watched descriptors, per user */ static long max_user_watches __read_mostly; /* Used for cycles detection */ static DEFINE_MUTEX(epnested_mutex); static u64 loop_check_gen = 0; /* Used to check for epoll file descriptor inclusion loops */ static struct eventpoll *inserting_into; /* Slab cache used to allocate "struct epitem" */ static struct kmem_cache *epi_cache __ro_after_init; /* Slab cache used to allocate "struct eppoll_entry" */ static struct kmem_cache *pwq_cache __ro_after_init; /* * List of files with newly added links, where we may need to limit the number * of emanating paths. Protected by the epnested_mutex. */ struct epitems_head { struct hlist_head epitems; struct epitems_head *next; }; static struct epitems_head *tfile_check_list = EP_UNACTIVE_PTR; static struct kmem_cache *ephead_cache __ro_after_init; static inline void free_ephead(struct epitems_head *head) { if (head) kmem_cache_free(ephead_cache, head); } static void list_file(struct file *file) { struct epitems_head *head; head = container_of(file->f_ep, struct epitems_head, epitems); if (!head->next) { head->next = tfile_check_list; tfile_check_list = head; } } static void unlist_file(struct epitems_head *head) { struct epitems_head *to_free = head; struct hlist_node *p = rcu_dereference(hlist_first_rcu(&head->epitems)); if (p) { struct epitem *epi= container_of(p, struct epitem, fllink); spin_lock(&epi->ffd.file->f_lock); if (!hlist_empty(&head->epitems)) to_free = NULL; head->next = NULL; spin_unlock(&epi->ffd.file->f_lock); } free_ephead(to_free); } #ifdef CONFIG_SYSCTL #include <linux/sysctl.h> static long long_zero; static long long_max = LONG_MAX; static const struct ctl_table epoll_table[] = { { .procname = "max_user_watches", .data = &max_user_watches, .maxlen = sizeof(max_user_watches), .mode = 0644, .proc_handler = proc_doulongvec_minmax, .extra1 = &long_zero, .extra2 = &long_max, }, }; static void __init epoll_sysctls_init(void) { register_sysctl("fs/epoll", epoll_table); } #else #define epoll_sysctls_init() do { } while (0) #endif /* CONFIG_SYSCTL */ static const struct file_operations eventpoll_fops; static inline int is_file_epoll(struct file *f) { return f->f_op == &eventpoll_fops; } /* Setup the structure that is used as key for the RB tree */ static inline void ep_set_ffd(struct epoll_filefd *ffd, struct file *file, int fd) { ffd->file = file; ffd->fd = fd; } /* Compare RB tree keys */ static inline int ep_cmp_ffd(struct epoll_filefd *p1, struct epoll_filefd *p2) { return (p1->file > p2->file ? +1: (p1->file < p2->file ? -1 : p1->fd - p2->fd)); } /* Tells us if the item is currently linked */ static inline int ep_is_linked(struct epitem *epi) { return !list_empty(&epi->rdllink); } static inline struct eppoll_entry *ep_pwq_from_wait(wait_queue_entry_t *p) { return container_of(p, struct eppoll_entry, wait); } /* Get the "struct epitem" from a wait queue pointer */ static inline struct epitem *ep_item_from_wait(wait_queue_entry_t *p) { return container_of(p, struct eppoll_entry, wait)->base; } /** * ep_events_available - Checks if ready events might be available. * * @ep: Pointer to the eventpoll context. * * Return: a value different than %zero if ready events are available, * or %zero otherwise. */ static inline int ep_events_available(struct eventpoll *ep) { return !list_empty_careful(&ep->rdllist) || READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR; } #ifdef CONFIG_NET_RX_BUSY_POLL /** * busy_loop_ep_timeout - check if busy poll has timed out. The timeout value * from the epoll instance ep is preferred, but if it is not set fallback to * the system-wide global via busy_loop_timeout. * * @start_time: The start time used to compute the remaining time until timeout. * @ep: Pointer to the eventpoll context. * * Return: true if the timeout has expired, false otherwise. */ static bool busy_loop_ep_timeout(unsigned long start_time, struct eventpoll *ep) { unsigned long bp_usec = READ_ONCE(ep->busy_poll_usecs); if (bp_usec) { unsigned long end_time = start_time + bp_usec; unsigned long now = busy_loop_current_time(); return time_after(now, end_time); } else { return busy_loop_timeout(start_time); } } static bool ep_busy_loop_on(struct eventpoll *ep) { return !!READ_ONCE(ep->busy_poll_usecs) || READ_ONCE(ep->prefer_busy_poll) || net_busy_loop_on(); } static bool ep_busy_loop_end(void *p, unsigned long start_time) { struct eventpoll *ep = p; return ep_events_available(ep) || busy_loop_ep_timeout(start_time, ep); } /* * Busy poll if globally on and supporting sockets found && no events, * busy loop will return if need_resched or ep_events_available. * * we must do our busy polling with irqs enabled */ static bool ep_busy_loop(struct eventpoll *ep) { unsigned int napi_id = READ_ONCE(ep->napi_id); u16 budget = READ_ONCE(ep->busy_poll_budget); bool prefer_busy_poll = READ_ONCE(ep->prefer_busy_poll); if (!budget) budget = BUSY_POLL_BUDGET; if (napi_id_valid(napi_id) && ep_busy_loop_on(ep)) { napi_busy_loop(napi_id, ep_busy_loop_end, ep, prefer_busy_poll, budget); if (ep_events_available(ep)) return true; /* * Busy poll timed out. Drop NAPI ID for now, we can add * it back in when we have moved a socket with a valid NAPI * ID onto the ready list. */ if (prefer_busy_poll) napi_resume_irqs(napi_id); ep->napi_id = 0; return false; } return false; } /* * Set epoll busy poll NAPI ID from sk. */ static inline void ep_set_busy_poll_napi_id(struct epitem *epi) { struct eventpoll *ep = epi->ep; unsigned int napi_id; struct socket *sock; struct sock *sk; if (!ep_busy_loop_on(ep)) return; sock = sock_from_file(epi->ffd.file); if (!sock) return; sk = sock->sk; if (!sk) return; napi_id = READ_ONCE(sk->sk_napi_id); /* Non-NAPI IDs can be rejected * or * Nothing to do if we already have this ID */ if (!napi_id_valid(napi_id) || napi_id == ep->napi_id) return; /* record NAPI ID for use in next busy poll */ ep->napi_id = napi_id; } static long ep_eventpoll_bp_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct eventpoll *ep = file->private_data; void __user *uarg = (void __user *)arg; struct epoll_params epoll_params; switch (cmd) { case EPIOCSPARAMS: if (copy_from_user(&epoll_params, uarg, sizeof(epoll_params))) return -EFAULT; /* pad byte must be zero */ if (epoll_params.__pad) return -EINVAL; if (epoll_params.busy_poll_usecs > S32_MAX) return -EINVAL; if (epoll_params.prefer_busy_poll > 1) return -EINVAL; if (epoll_params.busy_poll_budget > NAPI_POLL_WEIGHT && !capable(CAP_NET_ADMIN)) return -EPERM; WRITE_ONCE(ep->busy_poll_usecs, epoll_params.busy_poll_usecs); WRITE_ONCE(ep->busy_poll_budget, epoll_params.busy_poll_budget); WRITE_ONCE(ep->prefer_busy_poll, epoll_params.prefer_busy_poll); return 0; case EPIOCGPARAMS: memset(&epoll_params, 0, sizeof(epoll_params)); epoll_params.busy_poll_usecs = READ_ONCE(ep->busy_poll_usecs); epoll_params.busy_poll_budget = READ_ONCE(ep->busy_poll_budget); epoll_params.prefer_busy_poll = READ_ONCE(ep->prefer_busy_poll); if (copy_to_user(uarg, &epoll_params, sizeof(epoll_params))) return -EFAULT; return 0; default: return -ENOIOCTLCMD; } } static void ep_suspend_napi_irqs(struct eventpoll *ep) { unsigned int napi_id = READ_ONCE(ep->napi_id); if (napi_id_valid(napi_id) && READ_ONCE(ep->prefer_busy_poll)) napi_suspend_irqs(napi_id); } static void ep_resume_napi_irqs(struct eventpoll *ep) { unsigned int napi_id = READ_ONCE(ep->napi_id); if (napi_id_valid(napi_id) && READ_ONCE(ep->prefer_busy_poll)) napi_resume_irqs(napi_id); } #else static inline bool ep_busy_loop(struct eventpoll *ep) { return false; } static inline void ep_set_busy_poll_napi_id(struct epitem *epi) { } static long ep_eventpoll_bp_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { return -EOPNOTSUPP; } static void ep_suspend_napi_irqs(struct eventpoll *ep) { } static void ep_resume_napi_irqs(struct eventpoll *ep) { } #endif /* CONFIG_NET_RX_BUSY_POLL */ /* * As described in commit 0ccf831cb lockdep: annotate epoll * the use of wait queues used by epoll is done in a very controlled * manner. Wake ups can nest inside each other, but are never done * with the same locking. For example: * * dfd = socket(...); * efd1 = epoll_create(); * efd2 = epoll_create(); * epoll_ctl(efd1, EPOLL_CTL_ADD, dfd, ...); * epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...); * * When a packet arrives to the device underneath "dfd", the net code will * issue a wake_up() on its poll wake list. Epoll (efd1) has installed a * callback wakeup entry on that queue, and the wake_up() performed by the * "dfd" net code will end up in ep_poll_callback(). At this point epoll * (efd1) notices that it may have some event ready, so it needs to wake up * the waiters on its poll wait list (efd2). So it calls ep_poll_safewake() * that ends up in another wake_up(), after having checked about the * recursion constraints. That are, no more than EP_MAX_NESTS, to avoid * stack blasting. * * When CONFIG_DEBUG_LOCK_ALLOC is enabled, make sure lockdep can handle * this special case of epoll. */ #ifdef CONFIG_DEBUG_LOCK_ALLOC static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi, unsigned pollflags) { struct eventpoll *ep_src; unsigned long flags; u8 nests = 0; /* * To set the subclass or nesting level for spin_lock_irqsave_nested() * it might be natural to create a per-cpu nest count. However, since * we can recurse on ep->poll_wait.lock, and a non-raw spinlock can * schedule() in the -rt kernel, the per-cpu variable are no longer * protected. Thus, we are introducing a per eventpoll nest field. * If we are not being call from ep_poll_callback(), epi is NULL and * we are at the first level of nesting, 0. Otherwise, we are being * called from ep_poll_callback() and if a previous wakeup source is * not an epoll file itself, we are at depth 1 since the wakeup source * is depth 0. If the wakeup source is a previous epoll file in the * wakeup chain then we use its nests value and record ours as * nests + 1. The previous epoll file nests value is stable since its * already holding its own poll_wait.lock. */ if (epi) { if ((is_file_epoll(epi->ffd.file))) { ep_src = epi->ffd.file->private_data; nests = ep_src->nests; } else { nests = 1; } } spin_lock_irqsave_nested(&ep->poll_wait.lock, flags, nests); ep->nests = nests + 1; wake_up_locked_poll(&ep->poll_wait, EPOLLIN | pollflags); ep->nests = 0; spin_unlock_irqrestore(&ep->poll_wait.lock, flags); } #else static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi, __poll_t pollflags) { wake_up_poll(&ep->poll_wait, EPOLLIN | pollflags); } #endif static void ep_remove_wait_queue(struct eppoll_entry *pwq) { wait_queue_head_t *whead; rcu_read_lock(); /* * If it is cleared by POLLFREE, it should be rcu-safe. * If we read NULL we need a barrier paired with * smp_store_release() in ep_poll_callback(), otherwise * we rely on whead->lock. */ whead = smp_load_acquire(&pwq->whead); if (whead) remove_wait_queue(whead, &pwq->wait); rcu_read_unlock(); } /* * This function unregisters poll callbacks from the associated file * descriptor. Must be called with "mtx" held. */ static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi) { struct eppoll_entry **p = &epi->pwqlist; struct eppoll_entry *pwq; while ((pwq = *p) != NULL) { *p = pwq->next; ep_remove_wait_queue(pwq); kmem_cache_free(pwq_cache, pwq); } } /* call only when ep->mtx is held */ static inline struct wakeup_source *ep_wakeup_source(struct epitem *epi) { return rcu_dereference_check(epi->ws, lockdep_is_held(&epi->ep->mtx)); } /* call only when ep->mtx is held */ static inline void ep_pm_stay_awake(struct epitem *epi) { struct wakeup_source *ws = ep_wakeup_source(epi); if (ws) __pm_stay_awake(ws); } static inline bool ep_has_wakeup_source(struct epitem *epi) { return rcu_access_pointer(epi->ws) ? true : false; } /* call when ep->mtx cannot be held (ep_poll_callback) */ static inline void ep_pm_stay_awake_rcu(struct epitem *epi) { struct wakeup_source *ws; rcu_read_lock(); ws = rcu_dereference(epi->ws); if (ws) __pm_stay_awake(ws); rcu_read_unlock(); } /* * ep->mutex needs to be held because we could be hit by * eventpoll_release_file() and epoll_ctl(). */ static void ep_start_scan(struct eventpoll *ep, struct list_head *txlist) { /* * Steal the ready list, and re-init the original one to the * empty list. Also, set ep->ovflist to NULL so that events * happening while looping w/out locks, are not lost. We cannot * have the poll callback to queue directly on ep->rdllist, * because we want the "sproc" callback to be able to do it * in a lockless way. */ lockdep_assert_irqs_enabled(); write_lock_irq(&ep->lock); list_splice_init(&ep->rdllist, txlist); WRITE_ONCE(ep->ovflist, NULL); write_unlock_irq(&ep->lock); } static void ep_done_scan(struct eventpoll *ep, struct list_head *txlist) { struct epitem *epi, *nepi; write_lock_irq(&ep->lock); /* * During the time we spent inside the "sproc" callback, some * other events might have been queued by the poll callback. * We re-insert them inside the main ready-list here. */ for (nepi = READ_ONCE(ep->ovflist); (epi = nepi) != NULL; nepi = epi->next, epi->next = EP_UNACTIVE_PTR) { /* * We need to check if the item is already in the list. * During the "sproc" callback execution time, items are * queued into ->ovflist but the "txlist" might already * contain them, and the list_splice() below takes care of them. */ if (!ep_is_linked(epi)) { /* * ->ovflist is LIFO, so we have to reverse it in order * to keep in FIFO. */ list_add(&epi->rdllink, &ep->rdllist); ep_pm_stay_awake(epi); } } /* * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after * releasing the lock, events will be queued in the normal way inside * ep->rdllist. */ WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR); /* * Quickly re-inject items left on "txlist". */ list_splice(txlist, &ep->rdllist); __pm_relax(ep->ws); if (!list_empty(&ep->rdllist)) { if (waitqueue_active(&ep->wq)) wake_up(&ep->wq); } write_unlock_irq(&ep->lock); } static void ep_get(struct eventpoll *ep) { refcount_inc(&ep->refcount); } /* * Returns true if the event poll can be disposed */ static bool ep_refcount_dec_and_test(struct eventpoll *ep) { if (!refcount_dec_and_test(&ep->refcount)) return false; WARN_ON_ONCE(!RB_EMPTY_ROOT(&ep->rbr.rb_root)); return true; } static void ep_free(struct eventpoll *ep) { ep_resume_napi_irqs(ep); mutex_destroy(&ep->mtx); free_uid(ep->user); wakeup_source_unregister(ep->ws); kfree(ep); } /* * Removes a "struct epitem" from the eventpoll RB tree and deallocates * all the associated resources. Must be called with "mtx" held. * If the dying flag is set, do the removal only if force is true. * This prevents ep_clear_and_put() from dropping all the ep references * while running concurrently with eventpoll_release_file(). * Returns true if the eventpoll can be disposed. */ static bool __ep_remove(struct eventpoll *ep, struct epitem *epi, bool force) { struct file *file = epi->ffd.file; struct epitems_head *to_free; struct hlist_head *head; lockdep_assert_irqs_enabled(); /* * Removes poll wait queue hooks. */ ep_unregister_pollwait(ep, epi); /* Remove the current item from the list of epoll hooks */ spin_lock(&file->f_lock); if (epi->dying && !force) { spin_unlock(&file->f_lock); return false; } to_free = NULL; head = file->f_ep; if (head->first == &epi->fllink && !epi->fllink.next) { /* See eventpoll_release() for details. */ WRITE_ONCE(file->f_ep, NULL); if (!is_file_epoll(file)) { struct epitems_head *v; v = container_of(head, struct epitems_head, epitems); if (!smp_load_acquire(&v->next)) to_free = v; } } hlist_del_rcu(&epi->fllink); spin_unlock(&file->f_lock); free_ephead(to_free); rb_erase_cached(&epi->rbn, &ep->rbr); write_lock_irq(&ep->lock); if (ep_is_linked(epi)) list_del_init(&epi->rdllink); write_unlock_irq(&ep->lock); wakeup_source_unregister(ep_wakeup_source(epi)); /* * At this point it is safe to free the eventpoll item. Use the union * field epi->rcu, since we are trying to minimize the size of * 'struct epitem'. The 'rbn' field is no longer in use. Protected by * ep->mtx. The rcu read side, reverse_path_check_proc(), does not make * use of the rbn field. */ kfree_rcu(epi, rcu); percpu_counter_dec(&ep->user->epoll_watches); return true; } /* * ep_remove variant for callers owing an additional reference to the ep */ static void ep_remove_safe(struct eventpoll *ep, struct epitem *epi) { if (__ep_remove(ep, epi, false)) WARN_ON_ONCE(ep_refcount_dec_and_test(ep)); } static void ep_clear_and_put(struct eventpoll *ep) { struct rb_node *rbp, *next; struct epitem *epi; /* We need to release all tasks waiting for these file */ if (waitqueue_active(&ep->poll_wait)) ep_poll_safewake(ep, NULL, 0); mutex_lock(&ep->mtx); /* * Walks through the whole tree by unregistering poll callbacks. */ for (rbp = rb_first_cached(&ep->rbr); rbp; rbp = rb_next(rbp)) { epi = rb_entry(rbp, struct epitem, rbn); ep_unregister_pollwait(ep, epi); cond_resched(); } /* * Walks through the whole tree and try to free each "struct epitem". * Note that ep_remove_safe() will not remove the epitem in case of a * racing eventpoll_release_file(); the latter will do the removal. * At this point we are sure no poll callbacks will be lingering around. * Since we still own a reference to the eventpoll struct, the loop can't * dispose it. */ for (rbp = rb_first_cached(&ep->rbr); rbp; rbp = next) { next = rb_next(rbp); epi = rb_entry(rbp, struct epitem, rbn); ep_remove_safe(ep, epi); cond_resched(); } mutex_unlock(&ep->mtx); if (ep_refcount_dec_and_test(ep)) ep_free(ep); } static long ep_eventpoll_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { int ret; if (!is_file_epoll(file)) return -EINVAL; switch (cmd) { case EPIOCSPARAMS: case EPIOCGPARAMS: ret = ep_eventpoll_bp_ioctl(file, cmd, arg); break; default: ret = -EINVAL; break; } return ret; } static int ep_eventpoll_release(struct inode *inode, struct file *file) { struct eventpoll *ep = file->private_data; if (ep) ep_clear_and_put(ep); return 0; } static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt, int depth); static __poll_t __ep_eventpoll_poll(struct file *file, poll_table *wait, int depth) { struct eventpoll *ep = file->private_data; LIST_HEAD(txlist); struct epitem *epi, *tmp; poll_table pt; __poll_t res = 0; init_poll_funcptr(&pt, NULL); /* Insert inside our poll wait queue */ poll_wait(file, &ep->poll_wait, wait); /* * Proceed to find out if wanted events are really available inside * the ready list. */ mutex_lock_nested(&ep->mtx, depth); ep_start_scan(ep, &txlist); list_for_each_entry_safe(epi, tmp, &txlist, rdllink) { if (ep_item_poll(epi, &pt, depth + 1)) { res = EPOLLIN | EPOLLRDNORM; break; } else { /* * Item has been dropped into the ready list by the poll * callback, but it's not actually ready, as far as * caller requested events goes. We can remove it here. */ __pm_relax(ep_wakeup_source(epi)); list_del_init(&epi->rdllink); } } ep_done_scan(ep, &txlist); mutex_unlock(&ep->mtx); return res; } /* * The ffd.file pointer may be in the process of being torn down due to * being closed, but we may not have finished eventpoll_release() yet. * * Normally, even with the atomic_long_inc_not_zero, the file may have * been free'd and then gotten re-allocated to something else (since * files are not RCU-delayed, they are SLAB_TYPESAFE_BY_RCU). * * But for epoll, users hold the ep->mtx mutex, and as such any file in * the process of being free'd will block in eventpoll_release_file() * and thus the underlying file allocation will not be free'd, and the * file re-use cannot happen. * * For the same reason we can avoid a rcu_read_lock() around the * operation - 'ffd.file' cannot go away even if the refcount has * reached zero (but we must still not call out to ->poll() functions * etc). */ static struct file *epi_fget(const struct epitem *epi) { struct file *file; file = epi->ffd.file; if (!file_ref_get(&file->f_ref)) file = NULL; return file; } /* * Differs from ep_eventpoll_poll() in that internal callers already have * the ep->mtx so we need to start from depth=1, such that mutex_lock_nested() * is correctly annotated. */ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt, int depth) { struct file *file = epi_fget(epi); __poll_t res; /* * We could return EPOLLERR | EPOLLHUP or something, but let's * treat this more as "file doesn't exist, poll didn't happen". */ if (!file) return 0; pt->_key = epi->event.events; if (!is_file_epoll(file)) res = vfs_poll(file, pt); else res = __ep_eventpoll_poll(file, pt, depth); fput(file); return res & epi->event.events; } static __poll_t ep_eventpoll_poll(struct file *file, poll_table *wait) { return __ep_eventpoll_poll(file, wait, 0); } #ifdef CONFIG_PROC_FS static void ep_show_fdinfo(struct seq_file *m, struct file *f) { struct eventpoll *ep = f->private_data; struct rb_node *rbp; mutex_lock(&ep->mtx); for (rbp = rb_first_cached(&ep->rbr); rbp; rbp = rb_next(rbp)) { struct epitem *epi = rb_entry(rbp, struct epitem, rbn); struct inode *inode = file_inode(epi->ffd.file); seq_printf(m, "tfd: %8d events: %8x data: %16llx " " pos:%lli ino:%lx sdev:%x\n", epi->ffd.fd, epi->event.events, (long long)epi->event.data, (long long)epi->ffd.file->f_pos, inode->i_ino, inode->i_sb->s_dev); if (seq_has_overflowed(m)) break; } mutex_unlock(&ep->mtx); } #endif /* File callbacks that implement the eventpoll file behaviour */ static const struct file_operations eventpoll_fops = { #ifdef CONFIG_PROC_FS .show_fdinfo = ep_show_fdinfo, #endif .release = ep_eventpoll_release, .poll = ep_eventpoll_poll, .llseek = noop_llseek, .unlocked_ioctl = ep_eventpoll_ioctl, .compat_ioctl = compat_ptr_ioctl, }; /* * This is called from eventpoll_release() to unlink files from the eventpoll * interface. We need to have this facility to cleanup correctly files that are * closed without being removed from the eventpoll interface. */ void eventpoll_release_file(struct file *file) { struct eventpoll *ep; struct epitem *epi; bool dispose; /* * Use the 'dying' flag to prevent a concurrent ep_clear_and_put() from * touching the epitems list before eventpoll_release_file() can access * the ep->mtx. */ again: spin_lock(&file->f_lock); if (file->f_ep && file->f_ep->first) { epi = hlist_entry(file->f_ep->first, struct epitem, fllink); epi->dying = true; spin_unlock(&file->f_lock); /* * ep access is safe as we still own a reference to the ep * struct */ ep = epi->ep; mutex_lock(&ep->mtx); dispose = __ep_remove(ep, epi, true); mutex_unlock(&ep->mtx); if (dispose && ep_refcount_dec_and_test(ep)) ep_free(ep); goto again; } spin_unlock(&file->f_lock); } static int ep_alloc(struct eventpoll **pep) { struct eventpoll *ep; ep = kzalloc(sizeof(*ep), GFP_KERNEL); if (unlikely(!ep)) return -ENOMEM; mutex_init(&ep->mtx); rwlock_init(&ep->lock); init_waitqueue_head(&ep->wq); init_waitqueue_head(&ep->poll_wait); INIT_LIST_HEAD(&ep->rdllist); ep->rbr = RB_ROOT_CACHED; ep->ovflist = EP_UNACTIVE_PTR; ep->user = get_current_user(); refcount_set(&ep->refcount, 1); *pep = ep; return 0; } /* * Search the file inside the eventpoll tree. The RB tree operations * are protected by the "mtx" mutex, and ep_find() must be called with * "mtx" held. */ static struct epitem *ep_find(struct eventpoll *ep, struct file *file, int fd) { int kcmp; struct rb_node *rbp; struct epitem *epi, *epir = NULL; struct epoll_filefd ffd; ep_set_ffd(&ffd, file, fd); for (rbp = ep->rbr.rb_root.rb_node; rbp; ) { epi = rb_entry(rbp, struct epitem, rbn); kcmp = ep_cmp_ffd(&ffd, &epi->ffd); if (kcmp > 0) rbp = rbp->rb_right; else if (kcmp < 0) rbp = rbp->rb_left; else { epir = epi; break; } } return epir; } #ifdef CONFIG_KCMP static struct epitem *ep_find_tfd(struct eventpoll *ep, int tfd, unsigned long toff) { struct rb_node *rbp; struct epitem *epi; for (rbp = rb_first_cached(&ep->rbr); rbp; rbp = rb_next(rbp)) { epi = rb_entry(rbp, struct epitem, rbn); if (epi->ffd.fd == tfd) { if (toff == 0) return epi; else toff--; } cond_resched(); } return NULL; } struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd, unsigned long toff) { struct file *file_raw; struct eventpoll *ep; struct epitem *epi; if (!is_file_epoll(file)) return ERR_PTR(-EINVAL); ep = file->private_data; mutex_lock(&ep->mtx); epi = ep_find_tfd(ep, tfd, toff); if (epi) file_raw = epi->ffd.file; else file_raw = ERR_PTR(-ENOENT); mutex_unlock(&ep->mtx); return file_raw; } #endif /* CONFIG_KCMP */ /* * Adds a new entry to the tail of the list in a lockless way, i.e. * multiple CPUs are allowed to call this function concurrently. * * Beware: it is necessary to prevent any other modifications of the * existing list until all changes are completed, in other words * concurrent list_add_tail_lockless() calls should be protected * with a read lock, where write lock acts as a barrier which * makes sure all list_add_tail_lockless() calls are fully * completed. * * Also an element can be locklessly added to the list only in one * direction i.e. either to the tail or to the head, otherwise * concurrent access will corrupt the list. * * Return: %false if element has been already added to the list, %true * otherwise. */ static inline bool list_add_tail_lockless(struct list_head *new, struct list_head *head) { struct list_head *prev; /* * This is simple 'new->next = head' operation, but cmpxchg() * is used in order to detect that same element has been just * added to the list from another CPU: the winner observes * new->next == new. */ if (!try_cmpxchg(&new->next, &new, head)) return false; /* * Initially ->next of a new element must be updated with the head * (we are inserting to the tail) and only then pointers are atomically * exchanged. XCHG guarantees memory ordering, thus ->next should be * updated before pointers are actually swapped and pointers are * swapped before prev->next is updated. */ prev = xchg(&head->prev, new); /* * It is safe to modify prev->next and new->prev, because a new element * is added only to the tail and new->next is updated before XCHG. */ prev->next = new; new->prev = prev; return true; } /* * Chains a new epi entry to the tail of the ep->ovflist in a lockless way, * i.e. multiple CPUs are allowed to call this function concurrently. * * Return: %false if epi element has been already chained, %true otherwise. */ static inline bool chain_epi_lockless(struct epitem *epi) { struct eventpoll *ep = epi->ep; /* Fast preliminary check */ if (epi->next != EP_UNACTIVE_PTR) return false; /* Check that the same epi has not been just chained from another CPU */ if (cmpxchg(&epi->next, EP_UNACTIVE_PTR, NULL) != EP_UNACTIVE_PTR) return false; /* Atomically exchange tail */ epi->next = xchg(&ep->ovflist, epi); return true; } /* * This is the callback that is passed to the wait queue wakeup * mechanism. It is called by the stored file descriptors when they * have events to report. * * This callback takes a read lock in order not to contend with concurrent * events from another file descriptor, thus all modifications to ->rdllist * or ->ovflist are lockless. Read lock is paired with the write lock from * ep_start/done_scan(), which stops all list modifications and guarantees * that lists state is seen correctly. * * Another thing worth to mention is that ep_poll_callback() can be called * concurrently for the same @epi from different CPUs if poll table was inited * with several wait queues entries. Plural wakeup from different CPUs of a * single wait queue is serialized by wq.lock, but the case when multiple wait * queues are used should be detected accordingly. This is detected using * cmpxchg() operation. */ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) { int pwake = 0; struct epitem *epi = ep_item_from_wait(wait); struct eventpoll *ep = epi->ep; __poll_t pollflags = key_to_poll(key); unsigned long flags; int ewake = 0; read_lock_irqsave(&ep->lock, flags); ep_set_busy_poll_napi_id(epi); /* * If the event mask does not contain any poll(2) event, we consider the * descriptor to be disabled. This condition is likely the effect of the * EPOLLONESHOT bit that disables the descriptor when an event is received, * until the next EPOLL_CTL_MOD will be issued. */ if (!(epi->event.events & ~EP_PRIVATE_BITS)) goto out_unlock; /* * Check the events coming with the callback. At this stage, not * every device reports the events in the "key" parameter of the * callback. We need to be able to handle both cases here, hence the * test for "key" != NULL before the event match test. */ if (pollflags && !(pollflags & epi->event.events)) goto out_unlock; /* * If we are transferring events to userspace, we can hold no locks * (because we're accessing user memory, and because of linux f_op->poll() * semantics). All the events that happen during that period of time are * chained in ep->ovflist and requeued later on. */ if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) { if (chain_epi_lockless(epi)) ep_pm_stay_awake_rcu(epi); } else if (!ep_is_linked(epi)) { /* In the usual case, add event to ready list. */ if (list_add_tail_lockless(&epi->rdllink, &ep->rdllist)) ep_pm_stay_awake_rcu(epi); } /* * Wake up ( if active ) both the eventpoll wait list and the ->poll() * wait list. */ if (waitqueue_active(&ep->wq)) { if ((epi->event.events & EPOLLEXCLUSIVE) && !(pollflags & POLLFREE)) { switch (pollflags & EPOLLINOUT_BITS) { case EPOLLIN: if (epi->event.events & EPOLLIN) ewake = 1; break; case EPOLLOUT: if (epi->event.events & EPOLLOUT) ewake = 1; break; case 0: ewake = 1; break; } } if (sync) wake_up_sync(&ep->wq); else wake_up(&ep->wq); } if (waitqueue_active(&ep->poll_wait)) pwake++; out_unlock: read_unlock_irqrestore(&ep->lock, flags); /* We have to call this outside the lock */ if (pwake) ep_poll_safewake(ep, epi, pollflags & EPOLL_URING_WAKE); if (!(epi->event.events & EPOLLEXCLUSIVE)) ewake = 1; if (pollflags & POLLFREE) { /* * If we race with ep_remove_wait_queue() it can miss * ->whead = NULL and do another remove_wait_queue() after * us, so we can't use __remove_wait_queue(). */ list_del_init(&wait->entry); /* * ->whead != NULL protects us from the race with * ep_clear_and_put() or ep_remove(), ep_remove_wait_queue() * takes whead->lock held by the caller. Once we nullify it, * nothing protects ep/epi or even wait. */ smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL); } return ewake; } /* * This is the callback that is used to add our wait queue to the * target file wakeup lists. */ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, poll_table *pt) { struct ep_pqueue *epq = container_of(pt, struct ep_pqueue, pt); struct epitem *epi = epq->epi; struct eppoll_entry *pwq; if (unlikely(!epi)) // an earlier allocation has failed return; pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL); if (unlikely(!pwq)) { epq->epi = NULL; return; } init_waitqueue_func_entry(&pwq->wait, ep_poll_callback); pwq->whead = whead; pwq->base = epi; if (epi->event.events & EPOLLEXCLUSIVE) add_wait_queue_exclusive(whead, &pwq->wait); else add_wait_queue(whead, &pwq->wait); pwq->next = epi->pwqlist; epi->pwqlist = pwq; } static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi) { int kcmp; struct rb_node **p = &ep->rbr.rb_root.rb_node, *parent = NULL; struct epitem *epic; bool leftmost = true; while (*p) { parent = *p; epic = rb_entry(parent, struct epitem, rbn); kcmp = ep_cmp_ffd(&epi->ffd, &epic->ffd); if (kcmp > 0) { p = &parent->rb_right; leftmost = false; } else p = &parent->rb_left; } rb_link_node(&epi->rbn, parent, p); rb_insert_color_cached(&epi->rbn, &ep->rbr, leftmost); } #define PATH_ARR_SIZE 5 /* * These are the number paths of length 1 to 5, that we are allowing to emanate * from a single file of interest. For example, we allow 1000 paths of length * 1, to emanate from each file of interest. This essentially represents the * potential wakeup paths, which need to be limited in order to avoid massive * uncontrolled wakeup storms. The common use case should be a single ep which * is connected to n file sources. In this case each file source has 1 path * of length 1. Thus, the numbers below should be more than sufficient. These * path limits are enforced during an EPOLL_CTL_ADD operation, since a modify * and delete can't add additional paths. Protected by the epnested_mutex. */ static const int path_limits[PATH_ARR_SIZE] = { 1000, 500, 100, 50, 10 }; static int path_count[PATH_ARR_SIZE]; static int path_count_inc(int nests) { /* Allow an arbitrary number of depth 1 paths */ if (nests == 0) return 0; if (++path_count[nests] > path_limits[nests]) return -1; return 0; } static void path_count_init(void) { int i; for (i = 0; i < PATH_ARR_SIZE; i++) path_count[i] = 0; } static int reverse_path_check_proc(struct hlist_head *refs, int depth) { int error = 0; struct epitem *epi; if (depth > EP_MAX_NESTS) /* too deep nesting */ return -1; /* CTL_DEL can remove links here, but that can't increase our count */ hlist_for_each_entry_rcu(epi, refs, fllink) { struct hlist_head *refs = &epi->ep->refs; if (hlist_empty(refs)) error = path_count_inc(depth); else error = reverse_path_check_proc(refs, depth + 1); if (error != 0) break; } return error; } /** * reverse_path_check - The tfile_check_list is list of epitem_head, which have * links that are proposed to be newly added. We need to * make sure that those added links don't add too many * paths such that we will spend all our time waking up * eventpoll objects. * * Return: %zero if the proposed links don't create too many paths, * %-1 otherwise. */ static int reverse_path_check(void) { struct epitems_head *p; for (p = tfile_check_list; p != EP_UNACTIVE_PTR; p = p->next) { int error; path_count_init(); rcu_read_lock(); error = reverse_path_check_proc(&p->epitems, 0); rcu_read_unlock(); if (error) return error; } return 0; } static int ep_create_wakeup_source(struct epitem *epi) { struct name_snapshot n; struct wakeup_source *ws; if (!epi->ep->ws) { epi->ep->ws = wakeup_source_register(NULL, "eventpoll"); if (!epi->ep->ws) return -ENOMEM; } take_dentry_name_snapshot(&n, epi->ffd.file->f_path.dentry); ws = wakeup_source_register(NULL, n.name.name); release_dentry_name_snapshot(&n); if (!ws) return -ENOMEM; rcu_assign_pointer(epi->ws, ws); return 0; } /* rare code path, only used when EPOLL_CTL_MOD removes a wakeup source */ static noinline void ep_destroy_wakeup_source(struct epitem *epi) { struct wakeup_source *ws = ep_wakeup_source(epi); RCU_INIT_POINTER(epi->ws, NULL); /* * wait for ep_pm_stay_awake_rcu to finish, synchronize_rcu is * used internally by wakeup_source_remove, too (called by * wakeup_source_unregister), so we cannot use call_rcu */ synchronize_rcu(); wakeup_source_unregister(ws); } static int attach_epitem(struct file *file, struct epitem *epi) { struct epitems_head *to_free = NULL; struct hlist_head *head = NULL; struct eventpoll *ep = NULL; if (is_file_epoll(file)) ep = file->private_data; if (ep) { head = &ep->refs; } else if (!READ_ONCE(file->f_ep)) { allocate: to_free = kmem_cache_zalloc(ephead_cache, GFP_KERNEL); if (!to_free) return -ENOMEM; head = &to_free->epitems; } spin_lock(&file->f_lock); if (!file->f_ep) { if (unlikely(!head)) { spin_unlock(&file->f_lock); goto allocate; } /* See eventpoll_release() for details. */ WRITE_ONCE(file->f_ep, head); to_free = NULL; } hlist_add_head_rcu(&epi->fllink, file->f_ep); spin_unlock(&file->f_lock); free_ephead(to_free); return 0; } /* * Must be called with "mtx" held. */ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, struct file *tfile, int fd, int full_check) { int error, pwake = 0; __poll_t revents; struct epitem *epi; struct ep_pqueue epq; struct eventpoll *tep = NULL; if (is_file_epoll(tfile)) tep = tfile->private_data; lockdep_assert_irqs_enabled(); if (unlikely(percpu_counter_compare(&ep->user->epoll_watches, max_user_watches) >= 0)) return -ENOSPC; percpu_counter_inc(&ep->user->epoll_watches); if (!(epi = kmem_cache_zalloc(epi_cache, GFP_KERNEL))) { percpu_counter_dec(&ep->user->epoll_watches); return -ENOMEM; } /* Item initialization follow here ... */ INIT_LIST_HEAD(&epi->rdllink); epi->ep = ep; ep_set_ffd(&epi->ffd, tfile, fd); epi->event = *event; epi->next = EP_UNACTIVE_PTR; if (tep) mutex_lock_nested(&tep->mtx, 1); /* Add the current item to the list of active epoll hook for this file */ if (unlikely(attach_epitem(tfile, epi) < 0)) { if (tep) mutex_unlock(&tep->mtx); kmem_cache_free(epi_cache, epi); percpu_counter_dec(&ep->user->epoll_watches); return -ENOMEM; } if (full_check && !tep) list_file(tfile); /* * Add the current item to the RB tree. All RB tree operations are * protected by "mtx", and ep_insert() is called with "mtx" held. */ ep_rbtree_insert(ep, epi); if (tep) mutex_unlock(&tep->mtx); /* * ep_remove_safe() calls in the later error paths can't lead to * ep_free() as the ep file itself still holds an ep reference. */ ep_get(ep); /* now check if we've created too many backpaths */ if (unlikely(full_check && reverse_path_check())) { ep_remove_safe(ep, epi); return -EINVAL; } if (epi->event.events & EPOLLWAKEUP) { error = ep_create_wakeup_source(epi); if (error) { ep_remove_safe(ep, epi); return error; } } /* Initialize the poll table using the queue callback */ epq.epi = epi; init_poll_funcptr(&epq.pt, ep_ptable_queue_proc); /* * Attach the item to the poll hooks and get current event bits. * We can safely use the file* here because its usage count has * been increased by the caller of this function. Note that after * this operation completes, the poll callback can start hitting * the new item. */ revents = ep_item_poll(epi, &epq.pt, 1); /* * We have to check if something went wrong during the poll wait queue * install process. Namely an allocation for a wait queue failed due * high memory pressure. */ if (unlikely(!epq.epi)) { ep_remove_safe(ep, epi); return -ENOMEM; } /* We have to drop the new item inside our item list to keep track of it */ write_lock_irq(&ep->lock); /* record NAPI ID of new item if present */ ep_set_busy_poll_napi_id(epi); /* If the file is already "ready" we drop it inside the ready list */ if (revents && !ep_is_linked(epi)) { list_add_tail(&epi->rdllink, &ep->rdllist); ep_pm_stay_awake(epi); /* Notify waiting tasks that events are available */ if (waitqueue_active(&ep->wq)) wake_up(&ep->wq); if (waitqueue_active(&ep->poll_wait)) pwake++; } write_unlock_irq(&ep->lock); /* We have to call this outside the lock */ if (pwake) ep_poll_safewake(ep, NULL, 0); return 0; } /* * Modify the interest event mask by dropping an event if the new mask * has a match in the current file status. Must be called with "mtx" held. */ static int ep_modify(struct eventpoll *ep, struct epitem *epi, const struct epoll_event *event) { int pwake = 0; poll_table pt; lockdep_assert_irqs_enabled(); init_poll_funcptr(&pt, NULL); /* * Set the new event interest mask before calling f_op->poll(); * otherwise we might miss an event that happens between the * f_op->poll() call and the new event set registering. */ epi->event.events = event->events; /* need barrier below */ epi->event.data = event->data; /* protected by mtx */ if (epi->event.events & EPOLLWAKEUP) { if (!ep_has_wakeup_source(epi)) ep_create_wakeup_source(epi); } else if (ep_has_wakeup_source(epi)) { ep_destroy_wakeup_source(epi); } /* * The following barrier has two effects: * * 1) Flush epi changes above to other CPUs. This ensures * we do not miss events from ep_poll_callback if an * event occurs immediately after we call f_op->poll(). * We need this because we did not take ep->lock while * changing epi above (but ep_poll_callback does take * ep->lock). * * 2) We also need to ensure we do not miss _past_ events * when calling f_op->poll(). This barrier also * pairs with the barrier in wq_has_sleeper (see * comments for wq_has_sleeper). * * This barrier will now guarantee ep_poll_callback or f_op->poll * (or both) will notice the readiness of an item. */ smp_mb(); /* * Get current event bits. We can safely use the file* here because * its usage count has been increased by the caller of this function. * If the item is "hot" and it is not registered inside the ready * list, push it inside. */ if (ep_item_poll(epi, &pt, 1)) { write_lock_irq(&ep->lock); if (!ep_is_linked(epi)) { list_add_tail(&epi->rdllink, &ep->rdllist); ep_pm_stay_awake(epi); /* Notify waiting tasks that events are available */ if (waitqueue_active(&ep->wq)) wake_up(&ep->wq); if (waitqueue_active(&ep->poll_wait)) pwake++; } write_unlock_irq(&ep->lock); } /* We have to call this outside the lock */ if (pwake) ep_poll_safewake(ep, NULL, 0); return 0; } static int ep_send_events(struct eventpoll *ep, struct epoll_event __user *events, int maxevents) { struct epitem *epi, *tmp; LIST_HEAD(txlist); poll_table pt; int res = 0; /* * Always short-circuit for fatal signals to allow threads to make a * timely exit without the chance of finding more events available and * fetching repeatedly. */ if (fatal_signal_pending(current)) return -EINTR; init_poll_funcptr(&pt, NULL); mutex_lock(&ep->mtx); ep_start_scan(ep, &txlist); /* * We can loop without lock because we are passed a task private list. * Items cannot vanish during the loop we are holding ep->mtx. */ list_for_each_entry_safe(epi, tmp, &txlist, rdllink) { struct wakeup_source *ws; __poll_t revents; if (res >= maxevents) break; /* * Activate ep->ws before deactivating epi->ws to prevent * triggering auto-suspend here (in case we reactive epi->ws * below). * * This could be rearranged to delay the deactivation of epi->ws * instead, but then epi->ws would temporarily be out of sync * with ep_is_linked(). */ ws = ep_wakeup_source(epi); if (ws) { if (ws->active) __pm_stay_awake(ep->ws); __pm_relax(ws); } list_del_init(&epi->rdllink); /* * If the event mask intersect the caller-requested one, * deliver the event to userspace. Again, we are holding ep->mtx, * so no operations coming from userspace can change the item. */ revents = ep_item_poll(epi, &pt, 1); if (!revents) continue; events = epoll_put_uevent(revents, epi->event.data, events); if (!events) { list_add(&epi->rdllink, &txlist); ep_pm_stay_awake(epi); if (!res) res = -EFAULT; break; } res++; if (epi->event.events & EPOLLONESHOT) epi->event.events &= EP_PRIVATE_BITS; else if (!(epi->event.events & EPOLLET)) { /* * If this file has been added with Level * Trigger mode, we need to insert back inside * the ready list, so that the next call to * epoll_wait() will check again the events * availability. At this point, no one can insert * into ep->rdllist besides us. The epoll_ctl() * callers are locked out by * ep_send_events() holding "mtx" and the * poll callback will queue them in ep->ovflist. */ list_add_tail(&epi->rdllink, &ep->rdllist); ep_pm_stay_awake(epi); } } ep_done_scan(ep, &txlist); mutex_unlock(&ep->mtx); return res; } static struct timespec64 *ep_timeout_to_timespec(struct timespec64 *to, long ms) { struct timespec64 now; if (ms < 0) return NULL; if (!ms) { to->tv_sec = 0; to->tv_nsec = 0; return to; } to->tv_sec = ms / MSEC_PER_SEC; to->tv_nsec = NSEC_PER_MSEC * (ms % MSEC_PER_SEC); ktime_get_ts64(&now); *to = timespec64_add_safe(now, *to); return to; } /* * autoremove_wake_function, but remove even on failure to wake up, because we * know that default_wake_function/ttwu will only fail if the thread is already * woken, and in that case the ep_poll loop will remove the entry anyways, not * try to reuse it. */ static int ep_autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned int mode, int sync, void *key) { int ret = default_wake_function(wq_entry, mode, sync, key); /* * Pairs with list_empty_careful in ep_poll, and ensures future loop * iterations see the cause of this wakeup. */ list_del_init_careful(&wq_entry->entry); return ret; } static int ep_try_send_events(struct eventpoll *ep, struct epoll_event __user *events, int maxevents) { int res; /* * Try to transfer events to user space. In case we get 0 events and * there's still timeout left over, we go trying again in search of * more luck. */ res = ep_send_events(ep, events, maxevents); if (res > 0) ep_suspend_napi_irqs(ep); return res; } static int ep_schedule_timeout(ktime_t *to) { if (to) return ktime_after(*to, ktime_get()); else return 1; } /** * ep_poll - Retrieves ready events, and delivers them to the caller-supplied * event buffer. * * @ep: Pointer to the eventpoll context. * @events: Pointer to the userspace buffer where the ready events should be * stored. * @maxevents: Size (in terms of number of events) of the caller event buffer. * @timeout: Maximum timeout for the ready events fetch operation, in * timespec. If the timeout is zero, the function will not block, * while if the @timeout ptr is NULL, the function will block * until at least one event has been retrieved (or an error * occurred). * * Return: the number of ready events which have been fetched, or an * error code, in case of error. */ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, int maxevents, struct timespec64 *timeout) { int res, eavail, timed_out = 0; u64 slack = 0; wait_queue_entry_t wait; ktime_t expires, *to = NULL; lockdep_assert_irqs_enabled(); if (timeout && (timeout->tv_sec | timeout->tv_nsec)) { slack = select_estimate_accuracy(timeout); to = &expires; *to = timespec64_to_ktime(*timeout); } else if (timeout) { /* * Avoid the unnecessary trip to the wait queue loop, if the * caller specified a non blocking operation. */ timed_out = 1; } /* * This call is racy: We may or may not see events that are being added * to the ready list under the lock (e.g., in IRQ callbacks). For cases * with a non-zero timeout, this thread will check the ready list under * lock and will add to the wait queue. For cases with a zero * timeout, the user by definition should not care and will have to * recheck again. */ eavail = ep_events_available(ep); while (1) { if (eavail) { res = ep_try_send_events(ep, events, maxevents); if (res) return res; } if (timed_out) return 0; eavail = ep_busy_loop(ep); if (eavail) continue; if (signal_pending(current)) return -EINTR; /* * Internally init_wait() uses autoremove_wake_function(), * thus wait entry is removed from the wait queue on each * wakeup. Why it is important? In case of several waiters * each new wakeup will hit the next waiter, giving it the * chance to harvest new event. Otherwise wakeup can be * lost. This is also good performance-wise, because on * normal wakeup path no need to call __remove_wait_queue() * explicitly, thus ep->lock is not taken, which halts the * event delivery. * * In fact, we now use an even more aggressive function that * unconditionally removes, because we don't reuse the wait * entry between loop iterations. This lets us also avoid the * performance issue if a process is killed, causing all of its * threads to wake up without being removed normally. */ init_wait(&wait); wait.func = ep_autoremove_wake_function; write_lock_irq(&ep->lock); /* * Barrierless variant, waitqueue_active() is called under * the same lock on wakeup ep_poll_callback() side, so it * is safe to avoid an explicit barrier. */ __set_current_state(TASK_INTERRUPTIBLE); /* * Do the final check under the lock. ep_start/done_scan() * plays with two lists (->rdllist and ->ovflist) and there * is always a race when both lists are empty for short * period of time although events are pending, so lock is * important. */ eavail = ep_events_available(ep); if (!eavail) __add_wait_queue_exclusive(&ep->wq, &wait); write_unlock_irq(&ep->lock); if (!eavail) timed_out = !ep_schedule_timeout(to) || !schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS); __set_current_state(TASK_RUNNING); /* * We were woken up, thus go and try to harvest some events. * If timed out and still on the wait queue, recheck eavail * carefully under lock, below. */ eavail = 1; if (!list_empty_careful(&wait.entry)) { write_lock_irq(&ep->lock); /* * If the thread timed out and is not on the wait queue, * it means that the thread was woken up after its * timeout expired before it could reacquire the lock. * Thus, when wait.entry is empty, it needs to harvest * events. */ if (timed_out) eavail = list_empty(&wait.entry); __remove_wait_queue(&ep->wq, &wait); write_unlock_irq(&ep->lock); } } } /** * ep_loop_check_proc - verify that adding an epoll file @ep inside another * epoll file does not create closed loops, and * determine the depth of the subtree starting at @ep * * @ep: the &struct eventpoll to be currently checked. * @depth: Current depth of the path being checked. * * Return: depth of the subtree, or INT_MAX if we found a loop or went too deep. */ static int ep_loop_check_proc(struct eventpoll *ep, int depth) { int result = 0; struct rb_node *rbp; struct epitem *epi; if (ep->gen == loop_check_gen) return ep->loop_check_depth; mutex_lock_nested(&ep->mtx, depth + 1); ep->gen = loop_check_gen; for (rbp = rb_first_cached(&ep->rbr); rbp; rbp = rb_next(rbp)) { epi = rb_entry(rbp, struct epitem, rbn); if (unlikely(is_file_epoll(epi->ffd.file))) { struct eventpoll *ep_tovisit; ep_tovisit = epi->ffd.file->private_data; if (ep_tovisit == inserting_into || depth > EP_MAX_NESTS) result = INT_MAX; else result = max(result, ep_loop_check_proc(ep_tovisit, depth + 1) + 1); if (result > EP_MAX_NESTS) break; } else { /* * If we've reached a file that is not associated with * an ep, then we need to check if the newly added * links are going to add too many wakeup paths. We do * this by adding it to the tfile_check_list, if it's * not already there, and calling reverse_path_check() * during ep_insert(). */ list_file(epi->ffd.file); } } ep->loop_check_depth = result; mutex_unlock(&ep->mtx); return result; } /* ep_get_upwards_depth_proc - determine depth of @ep when traversed upwards */ static int ep_get_upwards_depth_proc(struct eventpoll *ep, int depth) { int result = 0; struct epitem *epi; if (ep->gen == loop_check_gen) return ep->loop_check_depth; hlist_for_each_entry_rcu(epi, &ep->refs, fllink) result = max(result, ep_get_upwards_depth_proc(epi->ep, depth + 1) + 1); ep->gen = loop_check_gen; ep->loop_check_depth = result; return result; } /** * ep_loop_check - Performs a check to verify that adding an epoll file (@to) * into another epoll file (represented by @ep) does not create * closed loops or too deep chains. * * @ep: Pointer to the epoll we are inserting into. * @to: Pointer to the epoll to be inserted. * * Return: %zero if adding the epoll @to inside the epoll @from * does not violate the constraints, or %-1 otherwise. */ static int ep_loop_check(struct eventpoll *ep, struct eventpoll *to) { int depth, upwards_depth; inserting_into = ep; /* * Check how deep down we can get from @to, and whether it is possible * to loop up to @ep. */ depth = ep_loop_check_proc(to, 0); if (depth > EP_MAX_NESTS) return -1; /* Check how far up we can go from @ep. */ rcu_read_lock(); upwards_depth = ep_get_upwards_depth_proc(ep, 0); rcu_read_unlock(); return (depth+1+upwards_depth > EP_MAX_NESTS) ? -1 : 0; } static void clear_tfile_check_list(void) { rcu_read_lock(); while (tfile_check_list != EP_UNACTIVE_PTR) { struct epitems_head *head = tfile_check_list; tfile_check_list = head->next; unlist_file(head); } rcu_read_unlock(); } /* * Open an eventpoll file descriptor. */ static int do_epoll_create(int flags) { int error, fd; struct eventpoll *ep = NULL; struct file *file; /* Check the EPOLL_* constant for consistency. */ BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC); if (flags & ~EPOLL_CLOEXEC) return -EINVAL; /* * Create the internal data structure ("struct eventpoll"). */ error = ep_alloc(&ep); if (error < 0) return error; /* * Creates all the items needed to setup an eventpoll file. That is, * a file structure and a free file descriptor. */ fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC)); if (fd < 0) { error = fd; goto out_free_ep; } file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep, O_RDWR | (flags & O_CLOEXEC)); if (IS_ERR(file)) { error = PTR_ERR(file); goto out_free_fd; } ep->file = file; fd_install(fd, file); return fd; out_free_fd: put_unused_fd(fd); out_free_ep: ep_clear_and_put(ep); return error; } SYSCALL_DEFINE1(epoll_create1, int, flags) { return do_epoll_create(flags); } SYSCALL_DEFINE1(epoll_create, int, size) { if (size <= 0) return -EINVAL; return do_epoll_create(0); } #ifdef CONFIG_PM_SLEEP static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) { if ((epev->events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND)) epev->events &= ~EPOLLWAKEUP; } #else static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) { epev->events &= ~EPOLLWAKEUP; } #endif static inline int epoll_mutex_lock(struct mutex *mutex, int depth, bool nonblock) { if (!nonblock) { mutex_lock_nested(mutex, depth); return 0; } if (mutex_trylock(mutex)) return 0; return -EAGAIN; } int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds, bool nonblock) { int error; int full_check = 0; struct eventpoll *ep; struct epitem *epi; struct eventpoll *tep = NULL; CLASS(fd, f)(epfd); if (fd_empty(f)) return -EBADF; /* Get the "struct file *" for the target file */ CLASS(fd, tf)(fd); if (fd_empty(tf)) return -EBADF; /* The target file descriptor must support poll */ if (!file_can_poll(fd_file(tf))) return -EPERM; /* Check if EPOLLWAKEUP is allowed */ if (ep_op_has_event(op)) ep_take_care_of_epollwakeup(epds); /* * We have to check that the file structure underneath the file descriptor * the user passed to us _is_ an eventpoll file. And also we do not permit * adding an epoll file descriptor inside itself. */ error = -EINVAL; if (fd_file(f) == fd_file(tf) || !is_file_epoll(fd_file(f))) goto error_tgt_fput; /* * epoll adds to the wakeup queue at EPOLL_CTL_ADD time only, * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation. * Also, we do not currently supported nested exclusive wakeups. */ if (ep_op_has_event(op) && (epds->events & EPOLLEXCLUSIVE)) { if (op == EPOLL_CTL_MOD) goto error_tgt_fput; if (op == EPOLL_CTL_ADD && (is_file_epoll(fd_file(tf)) || (epds->events & ~EPOLLEXCLUSIVE_OK_BITS))) goto error_tgt_fput; } /* * At this point it is safe to assume that the "private_data" contains * our own data structure. */ ep = fd_file(f)->private_data; /* * When we insert an epoll file descriptor inside another epoll file * descriptor, there is the chance of creating closed loops, which are * better be handled here, than in more critical paths. While we are * checking for loops we also determine the list of files reachable * and hang them on the tfile_check_list, so we can check that we * haven't created too many possible wakeup paths. * * We do not need to take the global 'epumutex' on EPOLL_CTL_ADD when * the epoll file descriptor is attaching directly to a wakeup source, * unless the epoll file descriptor is nested. The purpose of taking the * 'epnested_mutex' on add is to prevent complex toplogies such as loops and * deep wakeup paths from forming in parallel through multiple * EPOLL_CTL_ADD operations. */ error = epoll_mutex_lock(&ep->mtx, 0, nonblock); if (error) goto error_tgt_fput; if (op == EPOLL_CTL_ADD) { if (READ_ONCE(fd_file(f)->f_ep) || ep->gen == loop_check_gen || is_file_epoll(fd_file(tf))) { mutex_unlock(&ep->mtx); error = epoll_mutex_lock(&epnested_mutex, 0, nonblock); if (error) goto error_tgt_fput; loop_check_gen++; full_check = 1; if (is_file_epoll(fd_file(tf))) { tep = fd_file(tf)->private_data; error = -ELOOP; if (ep_loop_check(ep, tep) != 0) goto error_tgt_fput; } error = epoll_mutex_lock(&ep->mtx, 0, nonblock); if (error) goto error_tgt_fput; } } /* * Try to lookup the file inside our RB tree. Since we grabbed "mtx" * above, we can be sure to be able to use the item looked up by * ep_find() till we release the mutex. */ epi = ep_find(ep, fd_file(tf), fd); error = -EINVAL; switch (op) { case EPOLL_CTL_ADD: if (!epi) { epds->events |= EPOLLERR | EPOLLHUP; error = ep_insert(ep, epds, fd_file(tf), fd, full_check); } else error = -EEXIST; break; case EPOLL_CTL_DEL: if (epi) { /* * The eventpoll itself is still alive: the refcount * can't go to zero here. */ ep_remove_safe(ep, epi); error = 0; } else { error = -ENOENT; } break; case EPOLL_CTL_MOD: if (epi) { if (!(epi->event.events & EPOLLEXCLUSIVE)) { epds->events |= EPOLLERR | EPOLLHUP; error = ep_modify(ep, epi, epds); } } else error = -ENOENT; break; } mutex_unlock(&ep->mtx); error_tgt_fput: if (full_check) { clear_tfile_check_list(); loop_check_gen++; mutex_unlock(&epnested_mutex); } return error; } /* * The following function implements the controller interface for * the eventpoll file that enables the insertion/removal/change of * file descriptors inside the interest set. */ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, struct epoll_event __user *, event) { struct epoll_event epds; if (ep_op_has_event(op) && copy_from_user(&epds, event, sizeof(struct epoll_event))) return -EFAULT; return do_epoll_ctl(epfd, op, fd, &epds, false); } static int ep_check_params(struct file *file, struct epoll_event __user *evs, int maxevents) { /* The maximum number of event must be greater than zero */ if (maxevents <= 0 || maxevents > EP_MAX_EVENTS) return -EINVAL; /* Verify that the area passed by the user is writeable */ if (!access_ok(evs, maxevents * sizeof(struct epoll_event))) return -EFAULT; /* * We have to check that the file structure underneath the fd * the user passed to us _is_ an eventpoll file. */ if (!is_file_epoll(file)) return -EINVAL; return 0; } int epoll_sendevents(struct file *file, struct epoll_event __user *events, int maxevents) { struct eventpoll *ep; int ret; ret = ep_check_params(file, events, maxevents); if (unlikely(ret)) return ret; ep = file->private_data; /* * Racy call, but that's ok - it should get retried based on * poll readiness anyway. */ if (ep_events_available(ep)) return ep_try_send_events(ep, events, maxevents); return 0; } /* * Implement the event wait interface for the eventpoll file. It is the kernel * part of the user space epoll_wait(2). */ static int do_epoll_wait(int epfd, struct epoll_event __user *events, int maxevents, struct timespec64 *to) { struct eventpoll *ep; int ret; /* Get the "struct file *" for the eventpoll file */ CLASS(fd, f)(epfd); if (fd_empty(f)) return -EBADF; ret = ep_check_params(fd_file(f), events, maxevents); if (unlikely(ret)) return ret; /* * At this point it is safe to assume that the "private_data" contains * our own data structure. */ ep = fd_file(f)->private_data; /* Time to fish for events ... */ return ep_poll(ep, events, maxevents, to); } SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, int, maxevents, int, timeout) { struct timespec64 to; return do_epoll_wait(epfd, events, maxevents, ep_timeout_to_timespec(&to, timeout)); } /* * Implement the event wait interface for the eventpoll file. It is the kernel * part of the user space epoll_pwait(2). */ static int do_epoll_pwait(int epfd, struct epoll_event __user *events, int maxevents, struct timespec64 *to, const sigset_t __user *sigmask, size_t sigsetsize) { int error; /* * If the caller wants a certain signal mask to be set during the wait, * we apply it here. */ error = set_user_sigmask(sigmask, sigsetsize); if (error) return error; error = do_epoll_wait(epfd, events, maxevents, to); restore_saved_sigmask_unless(error == -EINTR); return error; } SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events, int, maxevents, int, timeout, const sigset_t __user *, sigmask, size_t, sigsetsize) { struct timespec64 to; return do_epoll_pwait(epfd, events, maxevents, ep_timeout_to_timespec(&to, timeout), sigmask, sigsetsize); } SYSCALL_DEFINE6(epoll_pwait2, int, epfd, struct epoll_event __user *, events, int, maxevents, const struct __kernel_timespec __user *, timeout, const sigset_t __user *, sigmask, size_t, sigsetsize) { struct timespec64 ts, *to = NULL; if (timeout) { if (get_timespec64(&ts, timeout)) return -EFAULT; to = &ts; if (poll_select_set_timeout(to, ts.tv_sec, ts.tv_nsec)) return -EINVAL; } return do_epoll_pwait(epfd, events, maxevents, to, sigmask, sigsetsize); } #ifdef CONFIG_COMPAT static int do_compat_epoll_pwait(int epfd, struct epoll_event __user *events, int maxevents, struct timespec64 *timeout, const compat_sigset_t __user *sigmask, compat_size_t sigsetsize) { long err; /* * If the caller wants a certain signal mask to be set during the wait, * we apply it here. */ err = set_compat_user_sigmask(sigmask, sigsetsize); if (err) return err; err = do_epoll_wait(epfd, events, maxevents, timeout); restore_saved_sigmask_unless(err == -EINTR); return err; } COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events, int, maxevents, int, timeout, const compat_sigset_t __user *, sigmask, compat_size_t, sigsetsize) { struct timespec64 to; return do_compat_epoll_pwait(epfd, events, maxevents, ep_timeout_to_timespec(&to, timeout), sigmask, sigsetsize); } COMPAT_SYSCALL_DEFINE6(epoll_pwait2, int, epfd, struct epoll_event __user *, events, int, maxevents, const struct __kernel_timespec __user *, timeout, const compat_sigset_t __user *, sigmask, compat_size_t, sigsetsize) { struct timespec64 ts, *to = NULL; if (timeout) { if (get_timespec64(&ts, timeout)) return -EFAULT; to = &ts; if (poll_select_set_timeout(to, ts.tv_sec, ts.tv_nsec)) return -EINVAL; } return do_compat_epoll_pwait(epfd, events, maxevents, to, sigmask, sigsetsize); } #endif static int __init eventpoll_init(void) { struct sysinfo si; si_meminfo(&si); /* * Allows top 4% of lomem to be allocated for epoll watches (per user). */ max_user_watches = (((si.totalram - si.totalhigh) / 25) << PAGE_SHIFT) / EP_ITEM_COST; BUG_ON(max_user_watches < 0); /* * We can have many thousands of epitems, so prevent this from * using an extra cache line on 64-bit (and smaller) CPUs */ BUILD_BUG_ON(sizeof(void *) <= 8 && sizeof(struct epitem) > 128); /* Allocates slab cache used to allocate "struct epitem" items */ epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); /* Allocates slab cache used to allocate "struct eppoll_entry" */ pwq_cache = kmem_cache_create("eventpoll_pwq", sizeof(struct eppoll_entry), 0, SLAB_PANIC|SLAB_ACCOUNT, NULL); epoll_sysctls_init(); ephead_cache = kmem_cache_create("ep_head", sizeof(struct epitems_head), 0, SLAB_PANIC|SLAB_ACCOUNT, NULL); return 0; } fs_initcall(eventpoll_init); |
| 757 183 109 1049 948 948 952 947 106 106 106 106 26 102 94 106 106 106 80 106 106 104 106 106 1216 28 1197 1197 34 1200 1203 38 5 1159 31 104 1061 339 277 550 607 1146 14 861 215 1069 33 76 1063 1060 149 946 117 117 117 183 183 183 66 117 117 66 117 949 956 953 219 749 732 749 751 293 457 1 870 79 78 938 115 812 9 9 17 2 951 946 951 955 952 530 529 528 528 527 1 1 1 530 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 | // SPDX-License-Identifier: GPL-2.0-or-later /* * IPv6 input * Linux INET6 implementation * * Authors: * Pedro Roque <roque@di.fc.ul.pt> * Ian P. Morris <I.P.Morris@soton.ac.uk> * * Based in linux/net/ipv4/ip_input.c */ /* Changes * * Mitsuru KANDA @USAGI and * YOSHIFUJI Hideaki @USAGI: Remove ipv6_parse_exthdrs(). */ #include <linux/errno.h> #include <linux/types.h> #include <linux/socket.h> #include <linux/sockios.h> #include <linux/net.h> #include <linux/netdevice.h> #include <linux/in6.h> #include <linux/icmpv6.h> #include <linux/mroute6.h> #include <linux/slab.h> #include <linux/indirect_call_wrapper.h> #include <linux/netfilter.h> #include <linux/netfilter_ipv6.h> #include <net/sock.h> #include <net/snmp.h> #include <net/udp.h> #include <net/ipv6.h> #include <net/protocol.h> #include <net/transp_v6.h> #include <net/rawv6.h> #include <net/ndisc.h> #include <net/ip6_route.h> #include <net/addrconf.h> #include <net/xfrm.h> #include <net/inet_ecn.h> #include <net/dst_metadata.h> static void ip6_rcv_finish_core(struct net *net, struct sock *sk, struct sk_buff *skb) { if (READ_ONCE(net->ipv4.sysctl_ip_early_demux) && !skb_dst(skb) && !skb->sk) { switch (ipv6_hdr(skb)->nexthdr) { case IPPROTO_TCP: if (READ_ONCE(net->ipv4.sysctl_tcp_early_demux)) tcp_v6_early_demux(skb); break; case IPPROTO_UDP: if (READ_ONCE(net->ipv4.sysctl_udp_early_demux)) udp_v6_early_demux(skb); break; } } if (!skb_valid_dst(skb)) ip6_route_input(skb); } int ip6_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb) { /* if ingress device is enslaved to an L3 master device pass the * skb to its handler for processing */ skb = l3mdev_ip6_rcv(skb); if (!skb) return NET_RX_SUCCESS; ip6_rcv_finish_core(net, sk, skb); return dst_input(skb); } static void ip6_sublist_rcv_finish(struct list_head *head) { struct sk_buff *skb, *next; list_for_each_entry_safe(skb, next, head, list) { skb_list_del_init(skb); dst_input(skb); } } static bool ip6_can_use_hint(const struct sk_buff *skb, const struct sk_buff *hint) { return hint && !skb_dst(skb) && ipv6_addr_equal(&ipv6_hdr(hint)->daddr, &ipv6_hdr(skb)->daddr); } static struct sk_buff *ip6_extract_route_hint(const struct net *net, struct sk_buff *skb) { if (fib6_routes_require_src(net) || fib6_has_custom_rules(net) || IP6CB(skb)->flags & IP6SKB_MULTIPATH) return NULL; return skb; } static void ip6_list_rcv_finish(struct net *net, struct sock *sk, struct list_head *head) { struct sk_buff *skb, *next, *hint = NULL; struct dst_entry *curr_dst = NULL; LIST_HEAD(sublist); list_for_each_entry_safe(skb, next, head, list) { struct dst_entry *dst; skb_list_del_init(skb); /* if ingress device is enslaved to an L3 master device pass the * skb to its handler for processing */ skb = l3mdev_ip6_rcv(skb); if (!skb) continue; if (ip6_can_use_hint(skb, hint)) skb_dst_copy(skb, hint); else ip6_rcv_finish_core(net, sk, skb); dst = skb_dst(skb); if (curr_dst != dst) { hint = ip6_extract_route_hint(net, skb); /* dispatch old sublist */ if (!list_empty(&sublist)) ip6_sublist_rcv_finish(&sublist); /* start new sublist */ INIT_LIST_HEAD(&sublist); curr_dst = dst; } list_add_tail(&skb->list, &sublist); } /* dispatch final sublist */ ip6_sublist_rcv_finish(&sublist); } static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev, struct net *net) { enum skb_drop_reason reason; const struct ipv6hdr *hdr; u32 pkt_len; struct inet6_dev *idev; if (skb->pkt_type == PACKET_OTHERHOST) { dev_core_stats_rx_otherhost_dropped_inc(skb->dev); kfree_skb_reason(skb, SKB_DROP_REASON_OTHERHOST); return NULL; } rcu_read_lock(); idev = __in6_dev_get(skb->dev); __IP6_UPD_PO_STATS(net, idev, IPSTATS_MIB_IN, skb->len); SKB_DR_SET(reason, NOT_SPECIFIED); if ((skb = skb_share_check(skb, GFP_ATOMIC)) == NULL || !idev || unlikely(READ_ONCE(idev->cnf.disable_ipv6))) { __IP6_INC_STATS(net, idev, IPSTATS_MIB_INDISCARDS); if (idev && unlikely(READ_ONCE(idev->cnf.disable_ipv6))) SKB_DR_SET(reason, IPV6DISABLED); goto drop; } memset(IP6CB(skb), 0, sizeof(struct inet6_skb_parm)); /* * Store incoming device index. When the packet will * be queued, we cannot refer to skb->dev anymore. * * BTW, when we send a packet for our own local address on a * non-loopback interface (e.g. ethX), it is being delivered * via the loopback interface (lo) here; skb->dev = loopback_dev. * It, however, should be considered as if it is being * arrived via the sending interface (ethX), because of the * nature of scoping architecture. --yoshfuji */ IP6CB(skb)->iif = skb_valid_dst(skb) ? ip6_dst_idev(skb_dst(skb))->dev->ifindex : dev->ifindex; if (unlikely(!pskb_may_pull(skb, sizeof(*hdr)))) goto err; hdr = ipv6_hdr(skb); if (hdr->version != 6) { SKB_DR_SET(reason, UNHANDLED_PROTO); goto err; } __IP6_ADD_STATS(net, idev, IPSTATS_MIB_NOECTPKTS + (ipv6_get_dsfield(hdr) & INET_ECN_MASK), max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs)); /* * RFC4291 2.5.3 * The loopback address must not be used as the source address in IPv6 * packets that are sent outside of a single node. [..] * A packet received on an interface with a destination address * of loopback must be dropped. */ if ((ipv6_addr_loopback(&hdr->saddr) || ipv6_addr_loopback(&hdr->daddr)) && !(dev->flags & IFF_LOOPBACK) && !netif_is_l3_master(dev)) goto err; /* RFC4291 Errata ID: 3480 * Interface-Local scope spans only a single interface on a * node and is useful only for loopback transmission of * multicast. Packets with interface-local scope received * from another node must be discarded. */ if (!(skb->pkt_type == PACKET_LOOPBACK || dev->flags & IFF_LOOPBACK) && ipv6_addr_is_multicast(&hdr->daddr) && IPV6_ADDR_MC_SCOPE(&hdr->daddr) == 1) goto err; /* If enabled, drop unicast packets that were encapsulated in link-layer * multicast or broadcast to protected against the so-called "hole-196" * attack in 802.11 wireless. */ if (!ipv6_addr_is_multicast(&hdr->daddr) && (skb->pkt_type == PACKET_BROADCAST || skb->pkt_type == PACKET_MULTICAST) && READ_ONCE(idev->cnf.drop_unicast_in_l2_multicast)) { SKB_DR_SET(reason, UNICAST_IN_L2_MULTICAST); goto err; } /* RFC4291 2.7 * Nodes must not originate a packet to a multicast address whose scope * field contains the reserved value 0; if such a packet is received, it * must be silently dropped. */ if (ipv6_addr_is_multicast(&hdr->daddr) && IPV6_ADDR_MC_SCOPE(&hdr->daddr) == 0) goto err; /* * RFC4291 2.7 * Multicast addresses must not be used as source addresses in IPv6 * packets or appear in any Routing header. */ if (ipv6_addr_is_multicast(&hdr->saddr)) goto err; skb->transport_header = skb->network_header + sizeof(*hdr); IP6CB(skb)->nhoff = offsetof(struct ipv6hdr, nexthdr); pkt_len = ntohs(hdr->payload_len); /* pkt_len may be zero if Jumbo payload option is present */ if (pkt_len || hdr->nexthdr != NEXTHDR_HOP) { if (pkt_len + sizeof(struct ipv6hdr) > skb->len) { __IP6_INC_STATS(net, idev, IPSTATS_MIB_INTRUNCATEDPKTS); SKB_DR_SET(reason, PKT_TOO_SMALL); goto drop; } if (pskb_trim_rcsum(skb, pkt_len + sizeof(struct ipv6hdr))) goto err; hdr = ipv6_hdr(skb); } if (hdr->nexthdr == NEXTHDR_HOP) { if (ipv6_parse_hopopts(skb) < 0) { __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS); rcu_read_unlock(); return NULL; } } rcu_read_unlock(); /* Must drop socket now because of tproxy. */ if (!skb_sk_is_prefetched(skb)) skb_orphan(skb); return skb; err: __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS); SKB_DR_OR(reason, IP_INHDR); drop: rcu_read_unlock(); kfree_skb_reason(skb, reason); return NULL; } int ipv6_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) { struct net *net = dev_net(skb->dev); skb = ip6_rcv_core(skb, dev, net); if (skb == NULL) return NET_RX_DROP; return NF_HOOK(NFPROTO_IPV6, NF_INET_PRE_ROUTING, net, NULL, skb, dev, NULL, ip6_rcv_finish); } static void ip6_sublist_rcv(struct list_head *head, struct net_device *dev, struct net *net) { NF_HOOK_LIST(NFPROTO_IPV6, NF_INET_PRE_ROUTING, net, NULL, head, dev, NULL, ip6_rcv_finish); ip6_list_rcv_finish(net, NULL, head); } /* Receive a list of IPv6 packets */ void ipv6_list_rcv(struct list_head *head, struct packet_type *pt, struct net_device *orig_dev) { struct net_device *curr_dev = NULL; struct net *curr_net = NULL; struct sk_buff *skb, *next; LIST_HEAD(sublist); list_for_each_entry_safe(skb, next, head, list) { struct net_device *dev = skb->dev; struct net *net = dev_net(dev); skb_list_del_init(skb); skb = ip6_rcv_core(skb, dev, net); if (skb == NULL) continue; if (curr_dev != dev || curr_net != net) { /* dispatch old sublist */ if (!list_empty(&sublist)) ip6_sublist_rcv(&sublist, curr_dev, curr_net); /* start new sublist */ INIT_LIST_HEAD(&sublist); curr_dev = dev; curr_net = net; } list_add_tail(&skb->list, &sublist); } /* dispatch final sublist */ if (!list_empty(&sublist)) ip6_sublist_rcv(&sublist, curr_dev, curr_net); } INDIRECT_CALLABLE_DECLARE(int tcp_v6_rcv(struct sk_buff *)); /* * Deliver the packet to the host */ void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int nexthdr, bool have_final) { const struct inet6_protocol *ipprot; struct inet6_dev *idev; unsigned int nhoff; SKB_DR(reason); bool raw; /* * Parse extension headers */ resubmit: idev = ip6_dst_idev(skb_dst(skb)); nhoff = IP6CB(skb)->nhoff; if (!have_final) { if (!pskb_pull(skb, skb_transport_offset(skb))) goto discard; nexthdr = skb_network_header(skb)[nhoff]; } resubmit_final: raw = raw6_local_deliver(skb, nexthdr); ipprot = rcu_dereference(inet6_protos[nexthdr]); if (ipprot) { int ret; if (have_final) { if (!(ipprot->flags & INET6_PROTO_FINAL)) { /* Once we've seen a final protocol don't * allow encapsulation on any non-final * ones. This allows foo in UDP encapsulation * to work. */ goto discard; } } else if (ipprot->flags & INET6_PROTO_FINAL) { const struct ipv6hdr *hdr; int sdif = inet6_sdif(skb); struct net_device *dev; /* Only do this once for first final protocol */ have_final = true; skb_postpull_rcsum(skb, skb_network_header(skb), skb_network_header_len(skb)); hdr = ipv6_hdr(skb); /* skb->dev passed may be master dev for vrfs. */ if (sdif) { dev = dev_get_by_index_rcu(net, sdif); if (!dev) goto discard; } else { dev = skb->dev; } if (ipv6_addr_is_multicast(&hdr->daddr) && !ipv6_chk_mcast_addr(dev, &hdr->daddr, &hdr->saddr) && !ipv6_is_mld(skb, nexthdr, skb_network_header_len(skb))) { SKB_DR_SET(reason, IP_INADDRERRORS); goto discard; } } if (!(ipprot->flags & INET6_PROTO_NOPOLICY)) { if (!xfrm6_policy_check(NULL, XFRM_POLICY_IN, skb)) { SKB_DR_SET(reason, XFRM_POLICY); goto discard; } nf_reset_ct(skb); } ret = INDIRECT_CALL_2(ipprot->handler, tcp_v6_rcv, udpv6_rcv, skb); if (ret > 0) { if (ipprot->flags & INET6_PROTO_FINAL) { /* Not an extension header, most likely UDP * encapsulation. Use return value as nexthdr * protocol not nhoff (which presumably is * not set by handler). */ nexthdr = ret; goto resubmit_final; } else { goto resubmit; } } else if (ret == 0) { __IP6_INC_STATS(net, idev, IPSTATS_MIB_INDELIVERS); } } else { if (!raw) { if (xfrm6_policy_check(NULL, XFRM_POLICY_IN, skb)) { __IP6_INC_STATS(net, idev, IPSTATS_MIB_INUNKNOWNPROTOS); icmpv6_send(skb, ICMPV6_PARAMPROB, ICMPV6_UNK_NEXTHDR, nhoff); SKB_DR_SET(reason, IP_NOPROTO); } else { SKB_DR_SET(reason, XFRM_POLICY); } kfree_skb_reason(skb, reason); } else { __IP6_INC_STATS(net, idev, IPSTATS_MIB_INDELIVERS); consume_skb(skb); } } return; discard: __IP6_INC_STATS(net, idev, IPSTATS_MIB_INDISCARDS); kfree_skb_reason(skb, reason); } static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *skb) { if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC))) { __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), IPSTATS_MIB_INDISCARDS); kfree_skb_reason(skb, SKB_DROP_REASON_NOMEM); return 0; } skb_clear_delivery_time(skb); ip6_protocol_deliver_rcu(net, skb, 0, false); return 0; } int ip6_input(struct sk_buff *skb) { int res; rcu_read_lock(); res = NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_IN, dev_net_rcu(skb->dev), NULL, skb, skb->dev, NULL, ip6_input_finish); rcu_read_unlock(); return res; } EXPORT_SYMBOL_GPL(ip6_input); int ip6_mc_input(struct sk_buff *skb) { struct net_device *dev = skb->dev; int sdif = inet6_sdif(skb); const struct ipv6hdr *hdr; bool deliver; __IP6_UPD_PO_STATS(skb_dst_dev_net_rcu(skb), __in6_dev_get_safely(dev), IPSTATS_MIB_INMCAST, skb->len); /* skb->dev passed may be master dev for vrfs. */ if (sdif) { dev = dev_get_by_index_rcu(dev_net_rcu(dev), sdif); if (!dev) { kfree_skb(skb); return -ENODEV; } } hdr = ipv6_hdr(skb); deliver = ipv6_chk_mcast_addr(dev, &hdr->daddr, NULL); #ifdef CONFIG_IPV6_MROUTE /* * IPv6 multicast router mode is now supported ;) */ if (atomic_read(&dev_net_rcu(skb->dev)->ipv6.devconf_all->mc_forwarding) && !(ipv6_addr_type(&hdr->daddr) & (IPV6_ADDR_LOOPBACK|IPV6_ADDR_LINKLOCAL)) && likely(!(IP6CB(skb)->flags & IP6SKB_FORWARDED))) { /* * Okay, we try to forward - split and duplicate * packets. */ struct sk_buff *skb2; struct inet6_skb_parm *opt = IP6CB(skb); /* Check for MLD */ if (unlikely(opt->flags & IP6SKB_ROUTERALERT)) { /* Check if this is a mld message */ u8 nexthdr = hdr->nexthdr; __be16 frag_off; int offset; /* Check if the value of Router Alert * is for MLD (0x0000). */ if (opt->ra == htons(IPV6_OPT_ROUTERALERT_MLD)) { deliver = false; if (!ipv6_ext_hdr(nexthdr)) { /* BUG */ goto out; } offset = ipv6_skip_exthdr(skb, sizeof(*hdr), &nexthdr, &frag_off); if (offset < 0) goto out; if (ipv6_is_mld(skb, nexthdr, offset)) deliver = true; goto out; } /* unknown RA - process it normally */ } if (deliver) { skb2 = skb_clone(skb, GFP_ATOMIC); } else { skb2 = skb; skb = NULL; } if (skb2) ip6_mr_input(skb2); } out: #endif if (likely(deliver)) { ip6_input(skb); } else { /* discard */ kfree_skb(skb); } return 0; } |
| 7 8 7 605 606 33 605 607 608 15 15 628 628 34 606 607 606 605 629 630 2 3 629 629 4 5 629 3 627 607 608 7 608 5 6 6 1 6 10 10 10 7 7 1 1 82 82 82 82 82 82 82 82 82 82 82 82 82 82 82 82 82 82 82 1 7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 | // SPDX-License-Identifier: GPL-2.0 /* * Block device elevator/IO-scheduler. * * Copyright (C) 2000 Andrea Arcangeli <andrea@suse.de> SuSE * * 30042000 Jens Axboe <axboe@kernel.dk> : * * Split the elevator a bit so that it is possible to choose a different * one or even write a new "plug in". There are three pieces: * - elevator_fn, inserts a new request in the queue list * - elevator_merge_fn, decides whether a new buffer can be merged with * an existing request * - elevator_dequeue_fn, called when a request is taken off the active list * * 20082000 Dave Jones <davej@suse.de> : * Removed tests for max-bomb-segments, which was breaking elvtune * when run without -bN * * Jens: * - Rework again to work with bio instead of buffer_heads * - loose bi_dev comparisons, partition handling is right now * - completely modularize elevator setup and teardown * */ #include <linux/kernel.h> #include <linux/fs.h> #include <linux/blkdev.h> #include <linux/bio.h> #include <linux/module.h> #include <linux/slab.h> #include <linux/init.h> #include <linux/compiler.h> #include <linux/blktrace_api.h> #include <linux/hash.h> #include <linux/uaccess.h> #include <linux/pm_runtime.h> #include <trace/events/block.h> #include "elevator.h" #include "blk.h" #include "blk-mq-sched.h" #include "blk-pm.h" #include "blk-wbt.h" #include "blk-cgroup.h" /* Holding context data for changing elevator */ struct elv_change_ctx { const char *name; bool no_uevent; /* for unregistering old elevator */ struct elevator_queue *old; /* for registering new elevator */ struct elevator_queue *new; /* holds sched tags data */ struct elevator_tags *et; }; static DEFINE_SPINLOCK(elv_list_lock); static LIST_HEAD(elv_list); /* * Merge hash stuff. */ #define rq_hash_key(rq) (blk_rq_pos(rq) + blk_rq_sectors(rq)) /* * Query io scheduler to see if the current process issuing bio may be * merged with rq. */ static bool elv_iosched_allow_bio_merge(struct request *rq, struct bio *bio) { struct request_queue *q = rq->q; struct elevator_queue *e = q->elevator; if (e->type->ops.allow_merge) return e->type->ops.allow_merge(q, rq, bio); return true; } /* * can we safely merge with this request? */ bool elv_bio_merge_ok(struct request *rq, struct bio *bio) { if (!blk_rq_merge_ok(rq, bio)) return false; if (!elv_iosched_allow_bio_merge(rq, bio)) return false; return true; } EXPORT_SYMBOL(elv_bio_merge_ok); /** * elevator_match - Check whether @e's name or alias matches @name * @e: Scheduler to test * @name: Elevator name to test * * Return true if the elevator @e's name or alias matches @name. */ static bool elevator_match(const struct elevator_type *e, const char *name) { return !strcmp(e->elevator_name, name) || (e->elevator_alias && !strcmp(e->elevator_alias, name)); } static struct elevator_type *__elevator_find(const char *name) { struct elevator_type *e; list_for_each_entry(e, &elv_list, list) if (elevator_match(e, name)) return e; return NULL; } static struct elevator_type *elevator_find_get(const char *name) { struct elevator_type *e; spin_lock(&elv_list_lock); e = __elevator_find(name); if (e && (!elevator_tryget(e))) e = NULL; spin_unlock(&elv_list_lock); return e; } static const struct kobj_type elv_ktype; struct elevator_queue *elevator_alloc(struct request_queue *q, struct elevator_type *e, struct elevator_tags *et) { struct elevator_queue *eq; eq = kzalloc_node(sizeof(*eq), GFP_KERNEL, q->node); if (unlikely(!eq)) return NULL; __elevator_get(e); eq->type = e; kobject_init(&eq->kobj, &elv_ktype); mutex_init(&eq->sysfs_lock); hash_init(eq->hash); eq->et = et; return eq; } static void elevator_release(struct kobject *kobj) { struct elevator_queue *e; e = container_of(kobj, struct elevator_queue, kobj); elevator_put(e->type); kfree(e); } static void elevator_exit(struct request_queue *q) { struct elevator_queue *e = q->elevator; lockdep_assert_held(&q->elevator_lock); ioc_clear_queue(q); mutex_lock(&e->sysfs_lock); blk_mq_exit_sched(q, e); mutex_unlock(&e->sysfs_lock); } static inline void __elv_rqhash_del(struct request *rq) { hash_del(&rq->hash); rq->rq_flags &= ~RQF_HASHED; } void elv_rqhash_del(struct request_queue *q, struct request *rq) { if (ELV_ON_HASH(rq)) __elv_rqhash_del(rq); } EXPORT_SYMBOL_GPL(elv_rqhash_del); void elv_rqhash_add(struct request_queue *q, struct request *rq) { struct elevator_queue *e = q->elevator; BUG_ON(ELV_ON_HASH(rq)); hash_add(e->hash, &rq->hash, rq_hash_key(rq)); rq->rq_flags |= RQF_HASHED; } EXPORT_SYMBOL_GPL(elv_rqhash_add); void elv_rqhash_reposition(struct request_queue *q, struct request *rq) { __elv_rqhash_del(rq); elv_rqhash_add(q, rq); } struct request *elv_rqhash_find(struct request_queue *q, sector_t offset) { struct elevator_queue *e = q->elevator; struct hlist_node *next; struct request *rq; hash_for_each_possible_safe(e->hash, rq, next, hash, offset) { BUG_ON(!ELV_ON_HASH(rq)); if (unlikely(!rq_mergeable(rq))) { __elv_rqhash_del(rq); continue; } if (rq_hash_key(rq) == offset) return rq; } return NULL; } /* * RB-tree support functions for inserting/lookup/removal of requests * in a sorted RB tree. */ void elv_rb_add(struct rb_root *root, struct request *rq) { struct rb_node **p = &root->rb_node; struct rb_node *parent = NULL; struct request *__rq; while (*p) { parent = *p; __rq = rb_entry(parent, struct request, rb_node); if (blk_rq_pos(rq) < blk_rq_pos(__rq)) p = &(*p)->rb_left; else if (blk_rq_pos(rq) >= blk_rq_pos(__rq)) p = &(*p)->rb_right; } rb_link_node(&rq->rb_node, parent, p); rb_insert_color(&rq->rb_node, root); } EXPORT_SYMBOL(elv_rb_add); void elv_rb_del(struct rb_root *root, struct request *rq) { BUG_ON(RB_EMPTY_NODE(&rq->rb_node)); rb_erase(&rq->rb_node, root); RB_CLEAR_NODE(&rq->rb_node); } EXPORT_SYMBOL(elv_rb_del); struct request *elv_rb_find(struct rb_root *root, sector_t sector) { struct rb_node *n = root->rb_node; struct request *rq; while (n) { rq = rb_entry(n, struct request, rb_node); if (sector < blk_rq_pos(rq)) n = n->rb_left; else if (sector > blk_rq_pos(rq)) n = n->rb_right; else return rq; } return NULL; } EXPORT_SYMBOL(elv_rb_find); enum elv_merge elv_merge(struct request_queue *q, struct request **req, struct bio *bio) { struct elevator_queue *e = q->elevator; struct request *__rq; /* * Levels of merges: * nomerges: No merges at all attempted * noxmerges: Only simple one-hit cache try * merges: All merge tries attempted */ if (blk_queue_nomerges(q) || !bio_mergeable(bio)) return ELEVATOR_NO_MERGE; /* * First try one-hit cache. */ if (q->last_merge && elv_bio_merge_ok(q->last_merge, bio)) { enum elv_merge ret = blk_try_merge(q->last_merge, bio); if (ret != ELEVATOR_NO_MERGE) { *req = q->last_merge; return ret; } } if (blk_queue_noxmerges(q)) return ELEVATOR_NO_MERGE; /* * See if our hash lookup can find a potential backmerge. */ __rq = elv_rqhash_find(q, bio->bi_iter.bi_sector); if (__rq && elv_bio_merge_ok(__rq, bio)) { *req = __rq; if (blk_discard_mergable(__rq)) return ELEVATOR_DISCARD_MERGE; return ELEVATOR_BACK_MERGE; } if (e->type->ops.request_merge) return e->type->ops.request_merge(q, req, bio); return ELEVATOR_NO_MERGE; } /* * Attempt to do an insertion back merge. Only check for the case where * we can append 'rq' to an existing request, so we can throw 'rq' away * afterwards. * * Returns true if we merged, false otherwise. 'free' will contain all * requests that need to be freed. */ bool elv_attempt_insert_merge(struct request_queue *q, struct request *rq, struct list_head *free) { struct request *__rq; bool ret; if (blk_queue_nomerges(q)) return false; /* * First try one-hit cache. */ if (q->last_merge && blk_attempt_req_merge(q, q->last_merge, rq)) { list_add(&rq->queuelist, free); return true; } if (blk_queue_noxmerges(q)) return false; ret = false; /* * See if our hash lookup can find a potential backmerge. */ while (1) { __rq = elv_rqhash_find(q, blk_rq_pos(rq)); if (!__rq || !blk_attempt_req_merge(q, __rq, rq)) break; list_add(&rq->queuelist, free); /* The merged request could be merged with others, try again */ ret = true; rq = __rq; } return ret; } void elv_merged_request(struct request_queue *q, struct request *rq, enum elv_merge type) { struct elevator_queue *e = q->elevator; if (e->type->ops.request_merged) e->type->ops.request_merged(q, rq, type); if (type == ELEVATOR_BACK_MERGE) elv_rqhash_reposition(q, rq); q->last_merge = rq; } void elv_merge_requests(struct request_queue *q, struct request *rq, struct request *next) { struct elevator_queue *e = q->elevator; if (e->type->ops.requests_merged) e->type->ops.requests_merged(q, rq, next); elv_rqhash_reposition(q, rq); q->last_merge = rq; } struct request *elv_latter_request(struct request_queue *q, struct request *rq) { struct elevator_queue *e = q->elevator; if (e->type->ops.next_request) return e->type->ops.next_request(q, rq); return NULL; } struct request *elv_former_request(struct request_queue *q, struct request *rq) { struct elevator_queue *e = q->elevator; if (e->type->ops.former_request) return e->type->ops.former_request(q, rq); return NULL; } #define to_elv(atr) container_of_const((atr), struct elv_fs_entry, attr) static ssize_t elv_attr_show(struct kobject *kobj, struct attribute *attr, char *page) { const struct elv_fs_entry *entry = to_elv(attr); struct elevator_queue *e; ssize_t error = -ENODEV; if (!entry->show) return -EIO; e = container_of(kobj, struct elevator_queue, kobj); mutex_lock(&e->sysfs_lock); if (!test_bit(ELEVATOR_FLAG_DYING, &e->flags)) error = entry->show(e, page); mutex_unlock(&e->sysfs_lock); return error; } static ssize_t elv_attr_store(struct kobject *kobj, struct attribute *attr, const char *page, size_t length) { const struct elv_fs_entry *entry = to_elv(attr); struct elevator_queue *e; ssize_t error = -ENODEV; if (!entry->store) return -EIO; e = container_of(kobj, struct elevator_queue, kobj); mutex_lock(&e->sysfs_lock); if (!test_bit(ELEVATOR_FLAG_DYING, &e->flags)) error = entry->store(e, page, length); mutex_unlock(&e->sysfs_lock); return error; } static const struct sysfs_ops elv_sysfs_ops = { .show = elv_attr_show, .store = elv_attr_store, }; static const struct kobj_type elv_ktype = { .sysfs_ops = &elv_sysfs_ops, .release = elevator_release, }; static int elv_register_queue(struct request_queue *q, struct elevator_queue *e, bool uevent) { int error; error = kobject_add(&e->kobj, &q->disk->queue_kobj, "iosched"); if (!error) { const struct elv_fs_entry *attr = e->type->elevator_attrs; if (attr) { while (attr->attr.name) { if (sysfs_create_file(&e->kobj, &attr->attr)) break; attr++; } } if (uevent) kobject_uevent(&e->kobj, KOBJ_ADD); /* * Sched is initialized, it is ready to export it via * debugfs */ blk_mq_sched_reg_debugfs(q); set_bit(ELEVATOR_FLAG_REGISTERED, &e->flags); } return error; } static void elv_unregister_queue(struct request_queue *q, struct elevator_queue *e) { if (e && test_and_clear_bit(ELEVATOR_FLAG_REGISTERED, &e->flags)) { kobject_uevent(&e->kobj, KOBJ_REMOVE); kobject_del(&e->kobj); /* unexport via debugfs before exiting sched */ blk_mq_sched_unreg_debugfs(q); } } int elv_register(struct elevator_type *e) { /* finish request is mandatory */ if (WARN_ON_ONCE(!e->ops.finish_request)) return -EINVAL; /* insert_requests and dispatch_request are mandatory */ if (WARN_ON_ONCE(!e->ops.insert_requests || !e->ops.dispatch_request)) return -EINVAL; /* create icq_cache if requested */ if (e->icq_size) { if (WARN_ON(e->icq_size < sizeof(struct io_cq)) || WARN_ON(e->icq_align < __alignof__(struct io_cq))) return -EINVAL; snprintf(e->icq_cache_name, sizeof(e->icq_cache_name), "%s_io_cq", e->elevator_name); e->icq_cache = kmem_cache_create(e->icq_cache_name, e->icq_size, e->icq_align, 0, NULL); if (!e->icq_cache) return -ENOMEM; } /* register, don't allow duplicate names */ spin_lock(&elv_list_lock); if (__elevator_find(e->elevator_name)) { spin_unlock(&elv_list_lock); kmem_cache_destroy(e->icq_cache); return -EBUSY; } list_add_tail(&e->list, &elv_list); spin_unlock(&elv_list_lock); printk(KERN_INFO "io scheduler %s registered\n", e->elevator_name); return 0; } EXPORT_SYMBOL_GPL(elv_register); void elv_unregister(struct elevator_type *e) { /* unregister */ spin_lock(&elv_list_lock); list_del_init(&e->list); spin_unlock(&elv_list_lock); /* * Destroy icq_cache if it exists. icq's are RCU managed. Make * sure all RCU operations are complete before proceeding. */ if (e->icq_cache) { rcu_barrier(); kmem_cache_destroy(e->icq_cache); e->icq_cache = NULL; } } EXPORT_SYMBOL_GPL(elv_unregister); /* * Switch to new_e io scheduler. * * If switching fails, we are most likely running out of memory and not able * to restore the old io scheduler, so leaving the io scheduler being none. */ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx) { struct elevator_type *new_e = NULL; int ret = 0; WARN_ON_ONCE(q->mq_freeze_depth == 0); lockdep_assert_held(&q->elevator_lock); if (strncmp(ctx->name, "none", 4)) { new_e = elevator_find_get(ctx->name); if (!new_e) return -EINVAL; } blk_mq_quiesce_queue(q); if (q->elevator) { ctx->old = q->elevator; elevator_exit(q); } if (new_e) { ret = blk_mq_init_sched(q, new_e, ctx->et); if (ret) goto out_unfreeze; ctx->new = q->elevator; } else { blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q); q->elevator = NULL; q->nr_requests = q->tag_set->queue_depth; } blk_add_trace_msg(q, "elv switch: %s", ctx->name); out_unfreeze: blk_mq_unquiesce_queue(q); if (ret) { pr_warn("elv: switch to \"%s\" failed, falling back to \"none\"\n", new_e->elevator_name); } if (new_e) elevator_put(new_e); return ret; } static void elv_exit_and_release(struct request_queue *q) { struct elevator_queue *e; unsigned memflags; memflags = blk_mq_freeze_queue(q); mutex_lock(&q->elevator_lock); e = q->elevator; elevator_exit(q); mutex_unlock(&q->elevator_lock); blk_mq_unfreeze_queue(q, memflags); if (e) { blk_mq_free_sched_tags(e->et, q->tag_set); kobject_put(&e->kobj); } } static int elevator_change_done(struct request_queue *q, struct elv_change_ctx *ctx) { int ret = 0; if (ctx->old) { bool enable_wbt = test_bit(ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT, &ctx->old->flags); elv_unregister_queue(q, ctx->old); blk_mq_free_sched_tags(ctx->old->et, q->tag_set); kobject_put(&ctx->old->kobj); if (enable_wbt) wbt_enable_default(q->disk); } if (ctx->new) { ret = elv_register_queue(q, ctx->new, !ctx->no_uevent); if (ret) elv_exit_and_release(q); } return ret; } /* * Switch this queue to the given IO scheduler. */ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx) { unsigned int memflags; struct blk_mq_tag_set *set = q->tag_set; int ret = 0; lockdep_assert_held(&set->update_nr_hwq_lock); if (strncmp(ctx->name, "none", 4)) { ctx->et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues); if (!ctx->et) return -ENOMEM; } memflags = blk_mq_freeze_queue(q); /* * May be called before adding disk, when there isn't any FS I/O, * so freezing queue plus canceling dispatch work is enough to * drain any dispatch activities originated from passthrough * requests, then no need to quiesce queue which may add long boot * latency, especially when lots of disks are involved. * * Disk isn't added yet, so verifying queue lock only manually. */ blk_mq_cancel_work_sync(q); mutex_lock(&q->elevator_lock); if (!(q->elevator && elevator_match(q->elevator->type, ctx->name))) ret = elevator_switch(q, ctx); mutex_unlock(&q->elevator_lock); blk_mq_unfreeze_queue(q, memflags); if (!ret) ret = elevator_change_done(q, ctx); /* * Free sched tags if it's allocated but we couldn't switch elevator. */ if (ctx->et && !ctx->new) blk_mq_free_sched_tags(ctx->et, set); return ret; } /* * The I/O scheduler depends on the number of hardware queues, this forces a * reattachment when nr_hw_queues changes. */ void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e, struct elevator_tags *t) { struct blk_mq_tag_set *set = q->tag_set; struct elv_change_ctx ctx = {}; int ret = -ENODEV; WARN_ON_ONCE(q->mq_freeze_depth == 0); if (e && !blk_queue_dying(q) && blk_queue_registered(q)) { ctx.name = e->elevator_name; ctx.et = t; mutex_lock(&q->elevator_lock); /* force to reattach elevator after nr_hw_queue is updated */ ret = elevator_switch(q, &ctx); mutex_unlock(&q->elevator_lock); } blk_mq_unfreeze_queue_nomemrestore(q); if (!ret) WARN_ON_ONCE(elevator_change_done(q, &ctx)); /* * Free sched tags if it's allocated but we couldn't switch elevator. */ if (t && !ctx.new) blk_mq_free_sched_tags(t, set); } /* * Use the default elevator settings. If the chosen elevator initialization * fails, fall back to the "none" elevator (no elevator). */ void elevator_set_default(struct request_queue *q) { struct elv_change_ctx ctx = { .name = "mq-deadline", .no_uevent = true, }; int err; struct elevator_type *e; /* now we allow to switch elevator */ blk_queue_flag_clear(QUEUE_FLAG_NO_ELV_SWITCH, q); if (q->tag_set->flags & BLK_MQ_F_NO_SCHED_BY_DEFAULT) return; /* * For single queue devices, default to using mq-deadline. If we * have multiple queues or mq-deadline is not available, default * to "none". */ e = elevator_find_get(ctx.name); if (!e) return; if ((q->nr_hw_queues == 1 || blk_mq_is_shared_tags(q->tag_set->flags))) { err = elevator_change(q, &ctx); if (err < 0) pr_warn("\"%s\" elevator initialization, failed %d, falling back to \"none\"\n", ctx.name, err); } elevator_put(e); } void elevator_set_none(struct request_queue *q) { struct elv_change_ctx ctx = { .name = "none", }; int err; err = elevator_change(q, &ctx); if (err < 0) pr_warn("%s: set none elevator failed %d\n", __func__, err); } static void elv_iosched_load_module(const char *elevator_name) { struct elevator_type *found; spin_lock(&elv_list_lock); found = __elevator_find(elevator_name); spin_unlock(&elv_list_lock); if (!found) request_module("%s-iosched", elevator_name); } ssize_t elv_iosched_store(struct gendisk *disk, const char *buf, size_t count) { char elevator_name[ELV_NAME_MAX]; struct elv_change_ctx ctx = {}; int ret; struct request_queue *q = disk->queue; struct blk_mq_tag_set *set = q->tag_set; /* Make sure queue is not in the middle of being removed */ if (!blk_queue_registered(q)) return -ENOENT; /* * If the attribute needs to load a module, do it before freezing the * queue to ensure that the module file can be read when the request * queue is the one for the device storing the module file. */ strscpy(elevator_name, buf, sizeof(elevator_name)); ctx.name = strstrip(elevator_name); elv_iosched_load_module(ctx.name); down_read(&set->update_nr_hwq_lock); if (!blk_queue_no_elv_switch(q)) { ret = elevator_change(q, &ctx); if (!ret) ret = count; } else { ret = -ENOENT; } up_read(&set->update_nr_hwq_lock); return ret; } ssize_t elv_iosched_show(struct gendisk *disk, char *name) { struct request_queue *q = disk->queue; struct elevator_type *cur = NULL, *e; int len = 0; mutex_lock(&q->elevator_lock); if (!q->elevator) { len += sprintf(name+len, "[none] "); } else { len += sprintf(name+len, "none "); cur = q->elevator->type; } spin_lock(&elv_list_lock); list_for_each_entry(e, &elv_list, list) { if (e == cur) len += sprintf(name+len, "[%s] ", e->elevator_name); else len += sprintf(name+len, "%s ", e->elevator_name); } spin_unlock(&elv_list_lock); len += sprintf(name+len, "\n"); mutex_unlock(&q->elevator_lock); return len; } struct request *elv_rb_former_request(struct request_queue *q, struct request *rq) { struct rb_node *rbprev = rb_prev(&rq->rb_node); if (rbprev) return rb_entry_rq(rbprev); return NULL; } EXPORT_SYMBOL(elv_rb_former_request); struct request *elv_rb_latter_request(struct request_queue *q, struct request *rq) { struct rb_node *rbnext = rb_next(&rq->rb_node); if (rbnext) return rb_entry_rq(rbnext); return NULL; } EXPORT_SYMBOL(elv_rb_latter_request); static int __init elevator_setup(char *str) { pr_warn("Kernel parameter elevator= does not have any effect anymore.\n" "Please use sysfs to set IO scheduler for individual devices.\n"); return 1; } __setup("elevator=", elevator_setup); |
| 137 436 2 6 448 453 452 29 1 2 27 4 2 11 1 18 26 26 13 6 3 1 1 1 3 1 2 3 5 1 4 1 3 1 1 1 1 19 2 18 489 462 36 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 | // SPDX-License-Identifier: GPL-2.0 #include <linux/mount.h> #include <linux/pseudo_fs.h> #include <linux/file.h> #include <linux/fs.h> #include <linux/proc_fs.h> #include <linux/proc_ns.h> #include <linux/magic.h> #include <linux/ktime.h> #include <linux/seq_file.h> #include <linux/pid_namespace.h> #include <linux/user_namespace.h> #include <linux/nsfs.h> #include <linux/uaccess.h> #include <linux/mnt_namespace.h> #include "mount.h" #include "internal.h" static struct vfsmount *nsfs_mnt; static long ns_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg); static const struct file_operations ns_file_operations = { .unlocked_ioctl = ns_ioctl, .compat_ioctl = compat_ptr_ioctl, }; static char *ns_dname(struct dentry *dentry, char *buffer, int buflen) { struct inode *inode = d_inode(dentry); struct ns_common *ns = inode->i_private; const struct proc_ns_operations *ns_ops = ns->ops; return dynamic_dname(buffer, buflen, "%s:[%lu]", ns_ops->name, inode->i_ino); } const struct dentry_operations ns_dentry_operations = { .d_dname = ns_dname, .d_prune = stashed_dentry_prune, }; static void nsfs_evict(struct inode *inode) { struct ns_common *ns = inode->i_private; clear_inode(inode); ns->ops->put(ns); } int ns_get_path_cb(struct path *path, ns_get_path_helper_t *ns_get_cb, void *private_data) { struct ns_common *ns; ns = ns_get_cb(private_data); if (!ns) return -ENOENT; return path_from_stashed(&ns->stashed, nsfs_mnt, ns, path); } struct ns_get_path_task_args { const struct proc_ns_operations *ns_ops; struct task_struct *task; }; static struct ns_common *ns_get_path_task(void *private_data) { struct ns_get_path_task_args *args = private_data; return args->ns_ops->get(args->task); } int ns_get_path(struct path *path, struct task_struct *task, const struct proc_ns_operations *ns_ops) { struct ns_get_path_task_args args = { .ns_ops = ns_ops, .task = task, }; return ns_get_path_cb(path, ns_get_path_task, &args); } /** * open_namespace - open a namespace * @ns: the namespace to open * * This will consume a reference to @ns indendent of success or failure. * * Return: A file descriptor on success or a negative error code on failure. */ int open_namespace(struct ns_common *ns) { struct path path __free(path_put) = {}; struct file *f; int err; /* call first to consume reference */ err = path_from_stashed(&ns->stashed, nsfs_mnt, ns, &path); if (err < 0) return err; CLASS(get_unused_fd, fd)(O_CLOEXEC); if (fd < 0) return fd; f = dentry_open(&path, O_RDONLY, current_cred()); if (IS_ERR(f)) return PTR_ERR(f); fd_install(fd, f); return take_fd(fd); } int open_related_ns(struct ns_common *ns, struct ns_common *(*get_ns)(struct ns_common *ns)) { struct ns_common *relative; relative = get_ns(ns); if (IS_ERR(relative)) return PTR_ERR(relative); return open_namespace(relative); } EXPORT_SYMBOL_GPL(open_related_ns); static int copy_ns_info_to_user(const struct mnt_namespace *mnt_ns, struct mnt_ns_info __user *uinfo, size_t usize, struct mnt_ns_info *kinfo) { /* * If userspace and the kernel have the same struct size it can just * be copied. If userspace provides an older struct, only the bits that * userspace knows about will be copied. If userspace provides a new * struct, only the bits that the kernel knows aobut will be copied and * the size value will be set to the size the kernel knows about. */ kinfo->size = min(usize, sizeof(*kinfo)); kinfo->mnt_ns_id = mnt_ns->seq; kinfo->nr_mounts = READ_ONCE(mnt_ns->nr_mounts); /* Subtract the root mount of the mount namespace. */ if (kinfo->nr_mounts) kinfo->nr_mounts--; if (copy_to_user(uinfo, kinfo, kinfo->size)) return -EFAULT; return 0; } static bool nsfs_ioctl_valid(unsigned int cmd) { switch (cmd) { case NS_GET_USERNS: case NS_GET_PARENT: case NS_GET_NSTYPE: case NS_GET_OWNER_UID: case NS_GET_MNTNS_ID: case NS_GET_PID_FROM_PIDNS: case NS_GET_TGID_FROM_PIDNS: case NS_GET_PID_IN_PIDNS: case NS_GET_TGID_IN_PIDNS: return (_IOC_TYPE(cmd) == _IOC_TYPE(cmd)); } /* Extensible ioctls require some extra handling. */ switch (_IOC_NR(cmd)) { case _IOC_NR(NS_MNT_GET_INFO): case _IOC_NR(NS_MNT_GET_NEXT): case _IOC_NR(NS_MNT_GET_PREV): return (_IOC_TYPE(cmd) == _IOC_TYPE(cmd)); } return false; } static long ns_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { struct user_namespace *user_ns; struct pid_namespace *pid_ns; struct task_struct *tsk; struct ns_common *ns; struct mnt_namespace *mnt_ns; bool previous = false; uid_t __user *argp; uid_t uid; int ret; if (!nsfs_ioctl_valid(ioctl)) return -ENOIOCTLCMD; ns = get_proc_ns(file_inode(filp)); switch (ioctl) { case NS_GET_USERNS: return open_related_ns(ns, ns_get_owner); case NS_GET_PARENT: if (!ns->ops->get_parent) return -EINVAL; return open_related_ns(ns, ns->ops->get_parent); case NS_GET_NSTYPE: return ns->ops->type; case NS_GET_OWNER_UID: if (ns->ops->type != CLONE_NEWUSER) return -EINVAL; user_ns = container_of(ns, struct user_namespace, ns); argp = (uid_t __user *) arg; uid = from_kuid_munged(current_user_ns(), user_ns->owner); return put_user(uid, argp); case NS_GET_MNTNS_ID: { __u64 __user *idp; __u64 id; if (ns->ops->type != CLONE_NEWNS) return -EINVAL; mnt_ns = container_of(ns, struct mnt_namespace, ns); idp = (__u64 __user *)arg; id = mnt_ns->seq; return put_user(id, idp); } case NS_GET_PID_FROM_PIDNS: fallthrough; case NS_GET_TGID_FROM_PIDNS: fallthrough; case NS_GET_PID_IN_PIDNS: fallthrough; case NS_GET_TGID_IN_PIDNS: { if (ns->ops->type != CLONE_NEWPID) return -EINVAL; ret = -ESRCH; pid_ns = container_of(ns, struct pid_namespace, ns); guard(rcu)(); if (ioctl == NS_GET_PID_IN_PIDNS || ioctl == NS_GET_TGID_IN_PIDNS) tsk = find_task_by_vpid(arg); else tsk = find_task_by_pid_ns(arg, pid_ns); if (!tsk) break; switch (ioctl) { case NS_GET_PID_FROM_PIDNS: ret = task_pid_vnr(tsk); break; case NS_GET_TGID_FROM_PIDNS: ret = task_tgid_vnr(tsk); break; case NS_GET_PID_IN_PIDNS: ret = task_pid_nr_ns(tsk, pid_ns); break; case NS_GET_TGID_IN_PIDNS: ret = task_tgid_nr_ns(tsk, pid_ns); break; default: ret = 0; break; } if (!ret) ret = -ESRCH; return ret; } } /* extensible ioctls */ switch (_IOC_NR(ioctl)) { case _IOC_NR(NS_MNT_GET_INFO): { struct mnt_ns_info kinfo = {}; struct mnt_ns_info __user *uinfo = (struct mnt_ns_info __user *)arg; size_t usize = _IOC_SIZE(ioctl); if (ns->ops->type != CLONE_NEWNS) return -EINVAL; if (!uinfo) return -EINVAL; if (usize < MNT_NS_INFO_SIZE_VER0) return -EINVAL; return copy_ns_info_to_user(to_mnt_ns(ns), uinfo, usize, &kinfo); } case _IOC_NR(NS_MNT_GET_PREV): previous = true; fallthrough; case _IOC_NR(NS_MNT_GET_NEXT): { struct mnt_ns_info kinfo = {}; struct mnt_ns_info __user *uinfo = (struct mnt_ns_info __user *)arg; struct path path __free(path_put) = {}; struct file *f __free(fput) = NULL; size_t usize = _IOC_SIZE(ioctl); if (ns->ops->type != CLONE_NEWNS) return -EINVAL; if (usize < MNT_NS_INFO_SIZE_VER0) return -EINVAL; mnt_ns = get_sequential_mnt_ns(to_mnt_ns(ns), previous); if (IS_ERR(mnt_ns)) return PTR_ERR(mnt_ns); ns = to_ns_common(mnt_ns); /* Transfer ownership of @mnt_ns reference to @path. */ ret = path_from_stashed(&ns->stashed, nsfs_mnt, ns, &path); if (ret) return ret; CLASS(get_unused_fd, fd)(O_CLOEXEC); if (fd < 0) return fd; f = dentry_open(&path, O_RDONLY, current_cred()); if (IS_ERR(f)) return PTR_ERR(f); if (uinfo) { /* * If @uinfo is passed return all information about the * mount namespace as well. */ ret = copy_ns_info_to_user(to_mnt_ns(ns), uinfo, usize, &kinfo); if (ret) return ret; } /* Transfer reference of @f to caller's fdtable. */ fd_install(fd, no_free_ptr(f)); /* File descriptor is live so hand it off to the caller. */ return take_fd(fd); } default: ret = -ENOTTY; } return ret; } int ns_get_name(char *buf, size_t size, struct task_struct *task, const struct proc_ns_operations *ns_ops) { struct ns_common *ns; int res = -ENOENT; const char *name; ns = ns_ops->get(task); if (ns) { name = ns_ops->real_ns_name ? : ns_ops->name; res = snprintf(buf, size, "%s:[%u]", name, ns->inum); ns_ops->put(ns); } return res; } bool proc_ns_file(const struct file *file) { return file->f_op == &ns_file_operations; } /** * ns_match() - Returns true if current namespace matches dev/ino provided. * @ns: current namespace * @dev: dev_t from nsfs that will be matched against current nsfs * @ino: ino_t from nsfs that will be matched against current nsfs * * Return: true if dev and ino matches the current nsfs. */ bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino) { return (ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev); } static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry) { struct inode *inode = d_inode(dentry); const struct ns_common *ns = inode->i_private; const struct proc_ns_operations *ns_ops = ns->ops; seq_printf(seq, "%s:[%lu]", ns_ops->name, inode->i_ino); return 0; } static const struct super_operations nsfs_ops = { .statfs = simple_statfs, .evict_inode = nsfs_evict, .show_path = nsfs_show_path, }; static int nsfs_init_inode(struct inode *inode, void *data) { struct ns_common *ns = data; inode->i_private = data; inode->i_mode |= S_IRUGO; inode->i_fop = &ns_file_operations; inode->i_ino = ns->inum; return 0; } static void nsfs_put_data(void *data) { struct ns_common *ns = data; ns->ops->put(ns); } static const struct stashed_operations nsfs_stashed_ops = { .init_inode = nsfs_init_inode, .put_data = nsfs_put_data, }; static int nsfs_init_fs_context(struct fs_context *fc) { struct pseudo_fs_context *ctx = init_pseudo(fc, NSFS_MAGIC); if (!ctx) return -ENOMEM; ctx->ops = &nsfs_ops; ctx->dops = &ns_dentry_operations; fc->s_fs_info = (void *)&nsfs_stashed_ops; return 0; } static struct file_system_type nsfs = { .name = "nsfs", .init_fs_context = nsfs_init_fs_context, .kill_sb = kill_anon_super, }; void __init nsfs_init(void) { nsfs_mnt = kern_mount(&nsfs); if (IS_ERR(nsfs_mnt)) panic("can't set nsfs up\n"); nsfs_mnt->mnt_sb->s_flags &= ~SB_NOUSER; } |
| 9 112 2 2 2 112 112 112 112 112 112 112 112 112 8 9 9 9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046 4047 4048 4049 4050 4051 4052 4053 4054 4055 4056 4057 4058 4059 4060 4061 4062 4063 4064 4065 4066 4067 4068 4069 4070 4071 4072 4073 4074 4075 4076 4077 4078 4079 4080 4081 4082 4083 4084 4085 4086 4087 4088 4089 4090 4091 4092 4093 4094 4095 4096 4097 4098 4099 4100 4101 4102 4103 4104 4105 4106 4107 4108 4109 4110 4111 4112 4113 4114 4115 4116 4117 4118 4119 4120 4121 4122 4123 4124 4125 4126 4127 4128 4129 4130 4131 4132 4133 4134 4135 4136 4137 4138 4139 4140 4141 4142 4143 4144 4145 4146 4147 4148 4149 4150 4151 4152 4153 4154 4155 4156 4157 4158 4159 4160 4161 4162 4163 4164 4165 4166 4167 4168 4169 4170 4171 4172 4173 4174 4175 4176 4177 4178 4179 4180 4181 4182 4183 4184 4185 4186 4187 4188 4189 4190 4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 4202 4203 4204 4205 4206 4207 4208 4209 4210 4211 4212 4213 4214 4215 4216 4217 4218 4219 4220 4221 | /* * Copyright (c) 2005 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/module.h> #include <linux/init.h> #include <linux/slab.h> #include <linux/err.h> #include <linux/string.h> #include <linux/parser.h> #include <linux/random.h> #include <linux/jiffies.h> #include <linux/lockdep.h> #include <linux/inet.h> #include <rdma/ib_cache.h> #include <linux/atomic.h> #include <scsi/scsi.h> #include <scsi/scsi_device.h> #include <scsi/scsi_dbg.h> #include <scsi/scsi_tcq.h> #include <scsi/srp.h> #include <scsi/scsi_transport_srp.h> #include "ib_srp.h" #define DRV_NAME "ib_srp" #define PFX DRV_NAME ": " MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("InfiniBand SCSI RDMA Protocol initiator"); MODULE_LICENSE("Dual BSD/GPL"); static unsigned int srp_sg_tablesize; static unsigned int cmd_sg_entries; static unsigned int indirect_sg_entries; static bool allow_ext_sg; static bool register_always = true; static bool never_register; static int topspin_workarounds = 1; module_param(srp_sg_tablesize, uint, 0444); MODULE_PARM_DESC(srp_sg_tablesize, "Deprecated name for cmd_sg_entries"); module_param(cmd_sg_entries, uint, 0444); MODULE_PARM_DESC(cmd_sg_entries, "Default number of gather/scatter entries in the SRP command (default is 12, max 255)"); module_param(indirect_sg_entries, uint, 0444); MODULE_PARM_DESC(indirect_sg_entries, "Default max number of gather/scatter entries (default is 12, max is " __stringify(SG_MAX_SEGMENTS) ")"); module_param(allow_ext_sg, bool, 0444); MODULE_PARM_DESC(allow_ext_sg, "Default behavior when there are more than cmd_sg_entries S/G entries after mapping; fails the request when false (default false)"); module_param(topspin_workarounds, int, 0444); MODULE_PARM_DESC(topspin_workarounds, "Enable workarounds for Topspin/Cisco SRP target bugs if != 0"); module_param(register_always, bool, 0444); MODULE_PARM_DESC(register_always, "Use memory registration even for contiguous memory regions"); module_param(never_register, bool, 0444); MODULE_PARM_DESC(never_register, "Never register memory"); static const struct kernel_param_ops srp_tmo_ops; static int srp_reconnect_delay = 10; module_param_cb(reconnect_delay, &srp_tmo_ops, &srp_reconnect_delay, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(reconnect_delay, "Time between successive reconnect attempts"); static int srp_fast_io_fail_tmo = 15; module_param_cb(fast_io_fail_tmo, &srp_tmo_ops, &srp_fast_io_fail_tmo, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(fast_io_fail_tmo, "Number of seconds between the observation of a transport" " layer error and failing all I/O. \"off\" means that this" " functionality is disabled."); static int srp_dev_loss_tmo = 600; module_param_cb(dev_loss_tmo, &srp_tmo_ops, &srp_dev_loss_tmo, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(dev_loss_tmo, "Maximum number of seconds that the SRP transport should" " insulate transport layer errors. After this time has been" " exceeded the SCSI host is removed. Should be" " between 1 and " __stringify(SCSI_DEVICE_BLOCK_MAX_TIMEOUT) " if fast_io_fail_tmo has not been set. \"off\" means that" " this functionality is disabled."); static bool srp_use_imm_data = true; module_param_named(use_imm_data, srp_use_imm_data, bool, 0644); MODULE_PARM_DESC(use_imm_data, "Whether or not to request permission to use immediate data during SRP login."); static unsigned int srp_max_imm_data = 8 * 1024; module_param_named(max_imm_data, srp_max_imm_data, uint, 0644); MODULE_PARM_DESC(max_imm_data, "Maximum immediate data size."); static unsigned ch_count; module_param(ch_count, uint, 0444); MODULE_PARM_DESC(ch_count, "Number of RDMA channels to use for communication with an SRP target. Using more than one channel improves performance if the HCA supports multiple completion vectors. The default value is the minimum of four times the number of online CPU sockets and the number of completion vectors supported by the HCA."); static int srp_add_one(struct ib_device *device); static void srp_remove_one(struct ib_device *device, void *client_data); static void srp_rename_dev(struct ib_device *device, void *client_data); static void srp_recv_done(struct ib_cq *cq, struct ib_wc *wc); static void srp_handle_qp_err(struct ib_cq *cq, struct ib_wc *wc, const char *opname); static int srp_ib_cm_handler(struct ib_cm_id *cm_id, const struct ib_cm_event *event); static int srp_rdma_cm_handler(struct rdma_cm_id *cm_id, struct rdma_cm_event *event); static struct scsi_transport_template *ib_srp_transport_template; static struct workqueue_struct *srp_remove_wq; static struct ib_client srp_client = { .name = "srp", .add = srp_add_one, .remove = srp_remove_one, .rename = srp_rename_dev }; static struct ib_sa_client srp_sa_client; static int srp_tmo_get(char *buffer, const struct kernel_param *kp) { int tmo = *(int *)kp->arg; if (tmo >= 0) return sysfs_emit(buffer, "%d\n", tmo); else return sysfs_emit(buffer, "off\n"); } static int srp_tmo_set(const char *val, const struct kernel_param *kp) { int tmo, res; res = srp_parse_tmo(&tmo, val); if (res) goto out; if (kp->arg == &srp_reconnect_delay) res = srp_tmo_valid(tmo, srp_fast_io_fail_tmo, srp_dev_loss_tmo); else if (kp->arg == &srp_fast_io_fail_tmo) res = srp_tmo_valid(srp_reconnect_delay, tmo, srp_dev_loss_tmo); else res = srp_tmo_valid(srp_reconnect_delay, srp_fast_io_fail_tmo, tmo); if (res) goto out; *(int *)kp->arg = tmo; out: return res; } static const struct kernel_param_ops srp_tmo_ops = { .get = srp_tmo_get, .set = srp_tmo_set, }; static inline struct srp_target_port *host_to_target(struct Scsi_Host *host) { return (struct srp_target_port *) host->hostdata; } static const char *srp_target_info(struct Scsi_Host *host) { return host_to_target(host)->target_name; } static int srp_target_is_topspin(struct srp_target_port *target) { static const u8 topspin_oui[3] = { 0x00, 0x05, 0xad }; static const u8 cisco_oui[3] = { 0x00, 0x1b, 0x0d }; return topspin_workarounds && (!memcmp(&target->ioc_guid, topspin_oui, sizeof topspin_oui) || !memcmp(&target->ioc_guid, cisco_oui, sizeof cisco_oui)); } static struct srp_iu *srp_alloc_iu(struct srp_host *host, size_t size, gfp_t gfp_mask, enum dma_data_direction direction) { struct srp_iu *iu; iu = kmalloc(sizeof *iu, gfp_mask); if (!iu) goto out; iu->buf = kzalloc(size, gfp_mask); if (!iu->buf) goto out_free_iu; iu->dma = ib_dma_map_single(host->srp_dev->dev, iu->buf, size, direction); if (ib_dma_mapping_error(host->srp_dev->dev, iu->dma)) goto out_free_buf; iu->size = size; iu->direction = direction; return iu; out_free_buf: kfree(iu->buf); out_free_iu: kfree(iu); out: return NULL; } static void srp_free_iu(struct srp_host *host, struct srp_iu *iu) { if (!iu) return; ib_dma_unmap_single(host->srp_dev->dev, iu->dma, iu->size, iu->direction); kfree(iu->buf); kfree(iu); } static void srp_qp_event(struct ib_event *event, void *context) { pr_debug("QP event %s (%d)\n", ib_event_msg(event->event), event->event); } static int srp_init_ib_qp(struct srp_target_port *target, struct ib_qp *qp) { struct ib_qp_attr *attr; int ret; attr = kmalloc(sizeof *attr, GFP_KERNEL); if (!attr) return -ENOMEM; ret = ib_find_cached_pkey(target->srp_host->srp_dev->dev, target->srp_host->port, be16_to_cpu(target->ib_cm.pkey), &attr->pkey_index); if (ret) goto out; attr->qp_state = IB_QPS_INIT; attr->qp_access_flags = (IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE); attr->port_num = target->srp_host->port; ret = ib_modify_qp(qp, attr, IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_ACCESS_FLAGS | IB_QP_PORT); out: kfree(attr); return ret; } static int srp_new_ib_cm_id(struct srp_rdma_ch *ch) { struct srp_target_port *target = ch->target; struct ib_cm_id *new_cm_id; new_cm_id = ib_create_cm_id(target->srp_host->srp_dev->dev, srp_ib_cm_handler, ch); if (IS_ERR(new_cm_id)) return PTR_ERR(new_cm_id); if (ch->ib_cm.cm_id) ib_destroy_cm_id(ch->ib_cm.cm_id); ch->ib_cm.cm_id = new_cm_id; if (rdma_cap_opa_ah(target->srp_host->srp_dev->dev, target->srp_host->port)) ch->ib_cm.path.rec_type = SA_PATH_REC_TYPE_OPA; else ch->ib_cm.path.rec_type = SA_PATH_REC_TYPE_IB; ch->ib_cm.path.sgid = target->sgid; ch->ib_cm.path.dgid = target->ib_cm.orig_dgid; ch->ib_cm.path.pkey = target->ib_cm.pkey; ch->ib_cm.path.service_id = target->ib_cm.service_id; return 0; } static int srp_new_rdma_cm_id(struct srp_rdma_ch *ch) { struct srp_target_port *target = ch->target; struct rdma_cm_id *new_cm_id; int ret; new_cm_id = rdma_create_id(target->net, srp_rdma_cm_handler, ch, RDMA_PS_TCP, IB_QPT_RC); if (IS_ERR(new_cm_id)) { ret = PTR_ERR(new_cm_id); new_cm_id = NULL; goto out; } init_completion(&ch->done); ret = rdma_resolve_addr(new_cm_id, target->rdma_cm.src_specified ? &target->rdma_cm.src.sa : NULL, &target->rdma_cm.dst.sa, SRP_PATH_REC_TIMEOUT_MS); if (ret) { pr_err("No route available from %pISpsc to %pISpsc (%d)\n", &target->rdma_cm.src, &target->rdma_cm.dst, ret); goto out; } ret = wait_for_completion_interruptible(&ch->done); if (ret < 0) goto out; ret = ch->status; if (ret) { pr_err("Resolving address %pISpsc failed (%d)\n", &target->rdma_cm.dst, ret); goto out; } swap(ch->rdma_cm.cm_id, new_cm_id); out: if (new_cm_id) rdma_destroy_id(new_cm_id); return ret; } static int srp_new_cm_id(struct srp_rdma_ch *ch) { struct srp_target_port *target = ch->target; return target->using_rdma_cm ? srp_new_rdma_cm_id(ch) : srp_new_ib_cm_id(ch); } /** * srp_destroy_fr_pool() - free the resources owned by a pool * @pool: Fast registration pool to be destroyed. */ static void srp_destroy_fr_pool(struct srp_fr_pool *pool) { int i; struct srp_fr_desc *d; if (!pool) return; for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { if (d->mr) ib_dereg_mr(d->mr); } kfree(pool); } /** * srp_create_fr_pool() - allocate and initialize a pool for fast registration * @device: IB device to allocate fast registration descriptors for. * @pd: Protection domain associated with the FR descriptors. * @pool_size: Number of descriptors to allocate. * @max_page_list_len: Maximum fast registration work request page list length. */ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, struct ib_pd *pd, int pool_size, int max_page_list_len) { struct srp_fr_pool *pool; struct srp_fr_desc *d; struct ib_mr *mr; int i, ret = -EINVAL; enum ib_mr_type mr_type; if (pool_size <= 0) goto err; ret = -ENOMEM; pool = kzalloc(struct_size(pool, desc, pool_size), GFP_KERNEL); if (!pool) goto err; pool->size = pool_size; pool->max_page_list_len = max_page_list_len; spin_lock_init(&pool->lock); INIT_LIST_HEAD(&pool->free_list); if (device->attrs.kernel_cap_flags & IBK_SG_GAPS_REG) mr_type = IB_MR_TYPE_SG_GAPS; else mr_type = IB_MR_TYPE_MEM_REG; for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { mr = ib_alloc_mr(pd, mr_type, max_page_list_len); if (IS_ERR(mr)) { ret = PTR_ERR(mr); if (ret == -ENOMEM) pr_info("%s: ib_alloc_mr() failed. Try to reduce max_cmd_per_lun, max_sect or ch_count\n", dev_name(&device->dev)); goto destroy_pool; } d->mr = mr; list_add_tail(&d->entry, &pool->free_list); } out: return pool; destroy_pool: srp_destroy_fr_pool(pool); err: pool = ERR_PTR(ret); goto out; } /** * srp_fr_pool_get() - obtain a descriptor suitable for fast registration * @pool: Pool to obtain descriptor from. */ static struct srp_fr_desc *srp_fr_pool_get(struct srp_fr_pool *pool) { struct srp_fr_desc *d = NULL; unsigned long flags; spin_lock_irqsave(&pool->lock, flags); if (!list_empty(&pool->free_list)) { d = list_first_entry(&pool->free_list, typeof(*d), entry); list_del(&d->entry); } spin_unlock_irqrestore(&pool->lock, flags); return d; } /** * srp_fr_pool_put() - put an FR descriptor back in the free list * @pool: Pool the descriptor was allocated from. * @desc: Pointer to an array of fast registration descriptor pointers. * @n: Number of descriptors to put back. * * Note: The caller must already have queued an invalidation request for * desc->mr->rkey before calling this function. */ static void srp_fr_pool_put(struct srp_fr_pool *pool, struct srp_fr_desc **desc, int n) { unsigned long flags; int i; spin_lock_irqsave(&pool->lock, flags); for (i = 0; i < n; i++) list_add(&desc[i]->entry, &pool->free_list); spin_unlock_irqrestore(&pool->lock, flags); } static struct srp_fr_pool *srp_alloc_fr_pool(struct srp_target_port *target) { struct srp_device *dev = target->srp_host->srp_dev; return srp_create_fr_pool(dev->dev, dev->pd, target->mr_pool_size, dev->max_pages_per_mr); } /** * srp_destroy_qp() - destroy an RDMA queue pair * @ch: SRP RDMA channel. * * Drain the qp before destroying it. This avoids that the receive * completion handler can access the queue pair while it is * being destroyed. */ static void srp_destroy_qp(struct srp_rdma_ch *ch) { spin_lock_irq(&ch->lock); ib_process_cq_direct(ch->send_cq, -1); spin_unlock_irq(&ch->lock); ib_drain_qp(ch->qp); ib_destroy_qp(ch->qp); } static int srp_create_ch_ib(struct srp_rdma_ch *ch) { struct srp_target_port *target = ch->target; struct srp_device *dev = target->srp_host->srp_dev; const struct ib_device_attr *attr = &dev->dev->attrs; struct ib_qp_init_attr *init_attr; struct ib_cq *recv_cq, *send_cq; struct ib_qp *qp; struct srp_fr_pool *fr_pool = NULL; const int m = 1 + dev->use_fast_reg * target->mr_per_cmd * 2; int ret; init_attr = kzalloc(sizeof *init_attr, GFP_KERNEL); if (!init_attr) return -ENOMEM; /* queue_size + 1 for ib_drain_rq() */ recv_cq = ib_alloc_cq(dev->dev, ch, target->queue_size + 1, ch->comp_vector, IB_POLL_SOFTIRQ); if (IS_ERR(recv_cq)) { ret = PTR_ERR(recv_cq); goto err; } send_cq = ib_alloc_cq(dev->dev, ch, m * target->queue_size, ch->comp_vector, IB_POLL_DIRECT); if (IS_ERR(send_cq)) { ret = PTR_ERR(send_cq); goto err_recv_cq; } init_attr->event_handler = srp_qp_event; init_attr->cap.max_send_wr = m * target->queue_size; init_attr->cap.max_recv_wr = target->queue_size + 1; init_attr->cap.max_recv_sge = 1; init_attr->cap.max_send_sge = min(SRP_MAX_SGE, attr->max_send_sge); init_attr->sq_sig_type = IB_SIGNAL_REQ_WR; init_attr->qp_type = IB_QPT_RC; init_attr->send_cq = send_cq; init_attr->recv_cq = recv_cq; ch->max_imm_sge = min(init_attr->cap.max_send_sge - 1U, 255U); if (target->using_rdma_cm) { ret = rdma_create_qp(ch->rdma_cm.cm_id, dev->pd, init_attr); qp = ch->rdma_cm.cm_id->qp; } else { qp = ib_create_qp(dev->pd, init_attr); if (!IS_ERR(qp)) { ret = srp_init_ib_qp(target, qp); if (ret) ib_destroy_qp(qp); } else { ret = PTR_ERR(qp); } } if (ret) { pr_err("QP creation failed for dev %s: %d\n", dev_name(&dev->dev->dev), ret); goto err_send_cq; } if (dev->use_fast_reg) { fr_pool = srp_alloc_fr_pool(target); if (IS_ERR(fr_pool)) { ret = PTR_ERR(fr_pool); shost_printk(KERN_WARNING, target->scsi_host, PFX "FR pool allocation failed (%d)\n", ret); goto err_qp; } } if (ch->qp) srp_destroy_qp(ch); if (ch->recv_cq) ib_free_cq(ch->recv_cq); if (ch->send_cq) ib_free_cq(ch->send_cq); ch->qp = qp; ch->recv_cq = recv_cq; ch->send_cq = send_cq; if (dev->use_fast_reg) { if (ch->fr_pool) srp_destroy_fr_pool(ch->fr_pool); ch->fr_pool = fr_pool; } kfree(init_attr); return 0; err_qp: if (target->using_rdma_cm) rdma_destroy_qp(ch->rdma_cm.cm_id); else ib_destroy_qp(qp); err_send_cq: ib_free_cq(send_cq); err_recv_cq: ib_free_cq(recv_cq); err: kfree(init_attr); return ret; } /* * Note: this function may be called without srp_alloc_iu_bufs() having been * invoked. Hence the ch->[rt]x_ring checks. */ static void srp_free_ch_ib(struct srp_target_port *target, struct srp_rdma_ch *ch) { struct srp_device *dev = target->srp_host->srp_dev; int i; if (!ch->target) return; if (target->using_rdma_cm) { if (ch->rdma_cm.cm_id) { rdma_destroy_id(ch->rdma_cm.cm_id); ch->rdma_cm.cm_id = NULL; } } else { if (ch->ib_cm.cm_id) { ib_destroy_cm_id(ch->ib_cm.cm_id); ch->ib_cm.cm_id = NULL; } } /* If srp_new_cm_id() succeeded but srp_create_ch_ib() not, return. */ if (!ch->qp) return; if (dev->use_fast_reg) { if (ch->fr_pool) srp_destroy_fr_pool(ch->fr_pool); } srp_destroy_qp(ch); ib_free_cq(ch->send_cq); ib_free_cq(ch->recv_cq); /* * Avoid that the SCSI error handler tries to use this channel after * it has been freed. The SCSI error handler can namely continue * trying to perform recovery actions after scsi_remove_host() * returned. */ ch->target = NULL; ch->qp = NULL; ch->send_cq = ch->recv_cq = NULL; if (ch->rx_ring) { for (i = 0; i < target->queue_size; ++i) srp_free_iu(target->srp_host, ch->rx_ring[i]); kfree(ch->rx_ring); ch->rx_ring = NULL; } if (ch->tx_ring) { for (i = 0; i < target->queue_size; ++i) srp_free_iu(target->srp_host, ch->tx_ring[i]); kfree(ch->tx_ring); ch->tx_ring = NULL; } } static void srp_path_rec_completion(int status, struct sa_path_rec *pathrec, unsigned int num_paths, void *ch_ptr) { struct srp_rdma_ch *ch = ch_ptr; struct srp_target_port *target = ch->target; ch->status = status; if (status) shost_printk(KERN_ERR, target->scsi_host, PFX "Got failed path rec status %d\n", status); else ch->ib_cm.path = *pathrec; complete(&ch->done); } static int srp_ib_lookup_path(struct srp_rdma_ch *ch) { struct srp_target_port *target = ch->target; int ret; ch->ib_cm.path.numb_path = 1; init_completion(&ch->done); ch->ib_cm.path_query_id = ib_sa_path_rec_get(&srp_sa_client, target->srp_host->srp_dev->dev, target->srp_host->port, &ch->ib_cm.path, IB_SA_PATH_REC_SERVICE_ID | IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH | IB_SA_PATH_REC_PKEY, SRP_PATH_REC_TIMEOUT_MS, GFP_KERNEL, srp_path_rec_completion, ch, &ch->ib_cm.path_query); if (ch->ib_cm.path_query_id < 0) return ch->ib_cm.path_query_id; ret = wait_for_completion_interruptible(&ch->done); if (ret < 0) return ret; if (ch->status < 0) shost_printk(KERN_WARNING, target->scsi_host, PFX "Path record query failed: sgid %pI6, dgid %pI6, pkey %#04x, service_id %#16llx\n", ch->ib_cm.path.sgid.raw, ch->ib_cm.path.dgid.raw, be16_to_cpu(target->ib_cm.pkey), be64_to_cpu(target->ib_cm.service_id)); return ch->status; } static int srp_rdma_lookup_path(struct srp_rdma_ch *ch) { struct srp_target_port *target = ch->target; int ret; init_completion(&ch->done); ret = rdma_resolve_route(ch->rdma_cm.cm_id, SRP_PATH_REC_TIMEOUT_MS); if (ret) return ret; wait_for_completion_interruptible(&ch->done); if (ch->status != 0) shost_printk(KERN_WARNING, target->scsi_host, PFX "Path resolution failed\n"); return ch->status; } static int srp_lookup_path(struct srp_rdma_ch *ch) { struct srp_target_port *target = ch->target; return target->using_rdma_cm ? srp_rdma_lookup_path(ch) : srp_ib_lookup_path(ch); } static u8 srp_get_subnet_timeout(struct srp_host *host) { struct ib_port_attr attr; int ret; u8 subnet_timeout = 18; ret = ib_query_port(host->srp_dev->dev, host->port, &attr); if (ret == 0) subnet_timeout = attr.subnet_timeout; if (unlikely(subnet_timeout < 15)) pr_warn("%s: subnet timeout %d may cause SRP login to fail.\n", dev_name(&host->srp_dev->dev->dev), subnet_timeout); return subnet_timeout; } static int srp_send_req(struct srp_rdma_ch *ch, uint32_t max_iu_len, bool multich) { struct srp_target_port *target = ch->target; struct { struct rdma_conn_param rdma_param; struct srp_login_req_rdma rdma_req; struct ib_cm_req_param ib_param; struct srp_login_req ib_req; } *req = NULL; char *ipi, *tpi; int status; req = kzalloc(sizeof *req, GFP_KERNEL); if (!req) return -ENOMEM; req->ib_param.flow_control = 1; req->ib_param.retry_count = target->tl_retry_count; /* * Pick some arbitrary defaults here; we could make these * module parameters if anyone cared about setting them. */ req->ib_param.responder_resources = 4; req->ib_param.rnr_retry_count = 7; req->ib_param.max_cm_retries = 15; req->ib_req.opcode = SRP_LOGIN_REQ; req->ib_req.tag = 0; req->ib_req.req_it_iu_len = cpu_to_be32(max_iu_len); req->ib_req.req_buf_fmt = cpu_to_be16(SRP_BUF_FORMAT_DIRECT | SRP_BUF_FORMAT_INDIRECT); req->ib_req.req_flags = (multich ? SRP_MULTICHAN_MULTI : SRP_MULTICHAN_SINGLE); if (srp_use_imm_data) { req->ib_req.req_flags |= SRP_IMMED_REQUESTED; req->ib_req.imm_data_offset = cpu_to_be16(SRP_IMM_DATA_OFFSET); } if (target->using_rdma_cm) { req->rdma_param.flow_control = req->ib_param.flow_control; req->rdma_param.responder_resources = req->ib_param.responder_resources; req->rdma_param.initiator_depth = req->ib_param.initiator_depth; req->rdma_param.retry_count = req->ib_param.retry_count; req->rdma_param.rnr_retry_count = req->ib_param.rnr_retry_count; req->rdma_param.private_data = &req->rdma_req; req->rdma_param.private_data_len = sizeof(req->rdma_req); req->rdma_req.opcode = req->ib_req.opcode; req->rdma_req.tag = req->ib_req.tag; req->rdma_req.req_it_iu_len = req->ib_req.req_it_iu_len; req->rdma_req.req_buf_fmt = req->ib_req.req_buf_fmt; req->rdma_req.req_flags = req->ib_req.req_flags; req->rdma_req.imm_data_offset = req->ib_req.imm_data_offset; ipi = req->rdma_req.initiator_port_id; tpi = req->rdma_req.target_port_id; } else { u8 subnet_timeout; subnet_timeout = srp_get_subnet_timeout(target->srp_host); req->ib_param.primary_path = &ch->ib_cm.path; req->ib_param.alternate_path = NULL; req->ib_param.service_id = target->ib_cm.service_id; get_random_bytes(&req->ib_param.starting_psn, 4); req->ib_param.starting_psn &= 0xffffff; req->ib_param.qp_num = ch->qp->qp_num; req->ib_param.qp_type = ch->qp->qp_type; req->ib_param.local_cm_response_timeout = subnet_timeout + 2; req->ib_param.remote_cm_response_timeout = subnet_timeout + 2; req->ib_param.private_data = &req->ib_req; req->ib_param.private_data_len = sizeof(req->ib_req); ipi = req->ib_req.initiator_port_id; tpi = req->ib_req.target_port_id; } /* * In the published SRP specification (draft rev. 16a), the * port identifier format is 8 bytes of ID extension followed * by 8 bytes of GUID. Older drafts put the two halves in the * opposite order, so that the GUID comes first. * * Targets conforming to these obsolete drafts can be * recognized by the I/O Class they report. */ if (target->io_class == SRP_REV10_IB_IO_CLASS) { memcpy(ipi, &target->sgid.global.interface_id, 8); memcpy(ipi + 8, &target->initiator_ext, 8); memcpy(tpi, &target->ioc_guid, 8); memcpy(tpi + 8, &target->id_ext, 8); } else { memcpy(ipi, &target->initiator_ext, 8); memcpy(ipi + 8, &target->sgid.global.interface_id, 8); memcpy(tpi, &target->id_ext, 8); memcpy(tpi + 8, &target->ioc_guid, 8); } /* * Topspin/Cisco SRP targets will reject our login unless we * zero out the first 8 bytes of our initiator port ID and set * the second 8 bytes to the local node GUID. */ if (srp_target_is_topspin(target)) { shost_printk(KERN_DEBUG, target->scsi_host, PFX "Topspin/Cisco initiator port ID workaround " "activated for target GUID %016llx\n", be64_to_cpu(target->ioc_guid)); memset(ipi, 0, 8); memcpy(ipi + 8, &target->srp_host->srp_dev->dev->node_guid, 8); } if (target->using_rdma_cm) status = rdma_connect(ch->rdma_cm.cm_id, &req->rdma_param); else status = ib_send_cm_req(ch->ib_cm.cm_id, &req->ib_param); kfree(req); return status; } static bool srp_queue_remove_work(struct srp_target_port *target) { bool changed = false; spin_lock_irq(&target->lock); if (target->state != SRP_TARGET_REMOVED) { target->state = SRP_TARGET_REMOVED; changed = true; } spin_unlock_irq(&target->lock); if (changed) queue_work(srp_remove_wq, &target->remove_work); return changed; } static void srp_disconnect_target(struct srp_target_port *target) { struct srp_rdma_ch *ch; int i, ret; /* XXX should send SRP_I_LOGOUT request */ for (i = 0; i < target->ch_count; i++) { ch = &target->ch[i]; ch->connected = false; ret = 0; if (target->using_rdma_cm) { if (ch->rdma_cm.cm_id) rdma_disconnect(ch->rdma_cm.cm_id); } else { if (ch->ib_cm.cm_id) ret = ib_send_cm_dreq(ch->ib_cm.cm_id, NULL, 0); } if (ret < 0) { shost_printk(KERN_DEBUG, target->scsi_host, PFX "Sending CM DREQ failed\n"); } } } static int srp_exit_cmd_priv(struct Scsi_Host *shost, struct scsi_cmnd *cmd) { struct srp_target_port *target = host_to_target(shost); struct srp_device *dev = target->srp_host->srp_dev; struct ib_device *ibdev = dev->dev; struct srp_request *req = scsi_cmd_priv(cmd); kfree(req->fr_list); if (req->indirect_dma_addr) { ib_dma_unmap_single(ibdev, req->indirect_dma_addr, target->indirect_size, DMA_TO_DEVICE); } kfree(req->indirect_desc); return 0; } static int srp_init_cmd_priv(struct Scsi_Host *shost, struct scsi_cmnd *cmd) { struct srp_target_port *target = host_to_target(shost); struct srp_device *srp_dev = target->srp_host->srp_dev; struct ib_device *ibdev = srp_dev->dev; struct srp_request *req = scsi_cmd_priv(cmd); dma_addr_t dma_addr; int ret = -ENOMEM; if (srp_dev->use_fast_reg) { req->fr_list = kmalloc_array(target->mr_per_cmd, sizeof(void *), GFP_KERNEL); if (!req->fr_list) goto out; } req->indirect_desc = kmalloc(target->indirect_size, GFP_KERNEL); if (!req->indirect_desc) goto out; dma_addr = ib_dma_map_single(ibdev, req->indirect_desc, target->indirect_size, DMA_TO_DEVICE); if (ib_dma_mapping_error(ibdev, dma_addr)) { srp_exit_cmd_priv(shost, cmd); goto out; } req->indirect_dma_addr = dma_addr; ret = 0; out: return ret; } /** * srp_del_scsi_host_attr() - Remove attributes defined in the host template. * @shost: SCSI host whose attributes to remove from sysfs. * * Note: Any attributes defined in the host template and that did not exist * before invocation of this function will be ignored. */ static void srp_del_scsi_host_attr(struct Scsi_Host *shost) { const struct attribute_group **g; struct attribute **attr; for (g = shost->hostt->shost_groups; *g; ++g) { for (attr = (*g)->attrs; *attr; ++attr) { struct device_attribute *dev_attr = container_of(*attr, typeof(*dev_attr), attr); device_remove_file(&shost->shost_dev, dev_attr); } } } static void srp_remove_target(struct srp_target_port *target) { struct srp_rdma_ch *ch; int i; WARN_ON_ONCE(target->state != SRP_TARGET_REMOVED); srp_del_scsi_host_attr(target->scsi_host); srp_rport_get(target->rport); srp_remove_host(target->scsi_host); scsi_remove_host(target->scsi_host); srp_stop_rport_timers(target->rport); srp_disconnect_target(target); kobj_ns_drop(KOBJ_NS_TYPE_NET, target->net); for (i = 0; i < target->ch_count; i++) { ch = &target->ch[i]; srp_free_ch_ib(target, ch); } cancel_work_sync(&target->tl_err_work); srp_rport_put(target->rport); kfree(target->ch); target->ch = NULL; spin_lock(&target->srp_host->target_lock); list_del(&target->list); spin_unlock(&target->srp_host->target_lock); scsi_host_put(target->scsi_host); } static void srp_remove_work(struct work_struct *work) { struct srp_target_port *target = container_of(work, struct srp_target_port, remove_work); WARN_ON_ONCE(target->state != SRP_TARGET_REMOVED); srp_remove_target(target); } static void srp_rport_delete(struct srp_rport *rport) { struct srp_target_port *target = rport->lld_data; srp_queue_remove_work(target); } /** * srp_connected_ch() - number of connected channels * @target: SRP target port. */ static int srp_connected_ch(struct srp_target_port *target) { int i, c = 0; for (i = 0; i < target->ch_count; i++) c += target->ch[i].connected; return c; } static int srp_connect_ch(struct srp_rdma_ch *ch, uint32_t max_iu_len, bool multich) { struct srp_target_port *target = ch->target; int ret; WARN_ON_ONCE(!multich && srp_connected_ch(target) > 0); ret = srp_lookup_path(ch); if (ret) goto out; while (1) { init_completion(&ch->done); ret = srp_send_req(ch, max_iu_len, multich); if (ret) goto out; ret = wait_for_completion_interruptible(&ch->done); if (ret < 0) goto out; /* * The CM event handling code will set status to * SRP_PORT_REDIRECT if we get a port redirect REJ * back, or SRP_DLID_REDIRECT if we get a lid/qp * redirect REJ back. */ ret = ch->status; switch (ret) { case 0: ch->connected = true; goto out; case SRP_PORT_REDIRECT: ret = srp_lookup_path(ch); if (ret) goto out; break; case SRP_DLID_REDIRECT: break; case SRP_STALE_CONN: shost_printk(KERN_ERR, target->scsi_host, PFX "giving up on stale connection\n"); ret = -ECONNRESET; goto out; default: goto out; } } out: return ret <= 0 ? ret : -ENODEV; } static void srp_inv_rkey_err_done(struct ib_cq *cq, struct ib_wc *wc) { srp_handle_qp_err(cq, wc, "INV RKEY"); } static int srp_inv_rkey(struct srp_request *req, struct srp_rdma_ch *ch, u32 rkey) { struct ib_send_wr wr = { .opcode = IB_WR_LOCAL_INV, .next = NULL, .num_sge = 0, .send_flags = 0, .ex.invalidate_rkey = rkey, }; wr.wr_cqe = &req->reg_cqe; req->reg_cqe.done = srp_inv_rkey_err_done; return ib_post_send(ch->qp, &wr, NULL); } static void srp_unmap_data(struct scsi_cmnd *scmnd, struct srp_rdma_ch *ch, struct srp_request *req) { struct srp_target_port *target = ch->target; struct srp_device *dev = target->srp_host->srp_dev; struct ib_device *ibdev = dev->dev; int i, res; if (!scsi_sglist(scmnd) || (scmnd->sc_data_direction != DMA_TO_DEVICE && scmnd->sc_data_direction != DMA_FROM_DEVICE)) return; if (dev->use_fast_reg) { struct srp_fr_desc **pfr; for (i = req->nmdesc, pfr = req->fr_list; i > 0; i--, pfr++) { res = srp_inv_rkey(req, ch, (*pfr)->mr->rkey); if (res < 0) { shost_printk(KERN_ERR, target->scsi_host, PFX "Queueing INV WR for rkey %#x failed (%d)\n", (*pfr)->mr->rkey, res); queue_work(system_long_wq, &target->tl_err_work); } } if (req->nmdesc) srp_fr_pool_put(ch->fr_pool, req->fr_list, req->nmdesc); } ib_dma_unmap_sg(ibdev, scsi_sglist(scmnd), scsi_sg_count(scmnd), scmnd->sc_data_direction); } /** * srp_claim_req - Take ownership of the scmnd associated with a request. * @ch: SRP RDMA channel. * @req: SRP request. * @sdev: If not NULL, only take ownership for this SCSI device. * @scmnd: If NULL, take ownership of @req->scmnd. If not NULL, only take * ownership of @req->scmnd if it equals @scmnd. * * Return value: * Either NULL or a pointer to the SCSI command the caller became owner of. */ static struct scsi_cmnd *srp_claim_req(struct srp_rdma_ch *ch, struct srp_request *req, struct scsi_device *sdev, struct scsi_cmnd *scmnd) { unsigned long flags; spin_lock_irqsave(&ch->lock, flags); if (req->scmnd && (!sdev || req->scmnd->device == sdev) && (!scmnd || req->scmnd == scmnd)) { scmnd = req->scmnd; req->scmnd = NULL; } else { scmnd = NULL; } spin_unlock_irqrestore(&ch->lock, flags); return scmnd; } /** * srp_free_req() - Unmap data and adjust ch->req_lim. * @ch: SRP RDMA channel. * @req: Request to be freed. * @scmnd: SCSI command associated with @req. * @req_lim_delta: Amount to be added to @target->req_lim. */ static void srp_free_req(struct srp_rdma_ch *ch, struct srp_request *req, struct scsi_cmnd *scmnd, s32 req_lim_delta) { unsigned long flags; srp_unmap_data(scmnd, ch, req); spin_lock_irqsave(&ch->lock, flags); ch->req_lim += req_lim_delta; spin_unlock_irqrestore(&ch->lock, flags); } static void srp_finish_req(struct srp_rdma_ch *ch, struct srp_request *req, struct scsi_device *sdev, int result) { struct scsi_cmnd *scmnd = srp_claim_req(ch, req, sdev, NULL); if (scmnd) { srp_free_req(ch, req, scmnd, 0); scmnd->result = result; scsi_done(scmnd); } } struct srp_terminate_context { struct srp_target_port *srp_target; int scsi_result; }; static bool srp_terminate_cmd(struct scsi_cmnd *scmnd, void *context_ptr) { struct srp_terminate_context *context = context_ptr; struct srp_target_port *target = context->srp_target; u32 tag = blk_mq_unique_tag(scsi_cmd_to_rq(scmnd)); struct srp_rdma_ch *ch = &target->ch[blk_mq_unique_tag_to_hwq(tag)]; struct srp_request *req = scsi_cmd_priv(scmnd); srp_finish_req(ch, req, NULL, context->scsi_result); return true; } static void srp_terminate_io(struct srp_rport *rport) { struct srp_target_port *target = rport->lld_data; struct srp_terminate_context context = { .srp_target = target, .scsi_result = DID_TRANSPORT_FAILFAST << 16 }; scsi_host_busy_iter(target->scsi_host, srp_terminate_cmd, &context); } /* Calculate maximum initiator to target information unit length. */ static uint32_t srp_max_it_iu_len(int cmd_sg_cnt, bool use_imm_data, uint32_t max_it_iu_size) { uint32_t max_iu_len = sizeof(struct srp_cmd) + SRP_MAX_ADD_CDB_LEN + sizeof(struct srp_indirect_buf) + cmd_sg_cnt * sizeof(struct srp_direct_buf); if (use_imm_data) max_iu_len = max(max_iu_len, SRP_IMM_DATA_OFFSET + srp_max_imm_data); if (max_it_iu_size) max_iu_len = min(max_iu_len, max_it_iu_size); pr_debug("max_iu_len = %d\n", max_iu_len); return max_iu_len; } /* * It is up to the caller to ensure that srp_rport_reconnect() calls are * serialized and that no concurrent srp_queuecommand(), srp_abort(), * srp_reset_device() or srp_reset_host() calls will occur while this function * is in progress. One way to realize that is not to call this function * directly but to call srp_reconnect_rport() instead since that last function * serializes calls of this function via rport->mutex and also blocks * srp_queuecommand() calls before invoking this function. */ static int srp_rport_reconnect(struct srp_rport *rport) { struct srp_target_port *target = rport->lld_data; struct srp_rdma_ch *ch; uint32_t max_iu_len = srp_max_it_iu_len(target->cmd_sg_cnt, srp_use_imm_data, target->max_it_iu_size); int i, j, ret = 0; bool multich = false; srp_disconnect_target(target); if (target->state == SRP_TARGET_SCANNING) return -ENODEV; /* * Now get a new local CM ID so that we avoid confusing the target in * case things are really fouled up. Doing so also ensures that all CM * callbacks will have finished before a new QP is allocated. */ for (i = 0; i < target->ch_count; i++) { ch = &target->ch[i]; ret += srp_new_cm_id(ch); } { struct srp_terminate_context context = { .srp_target = target, .scsi_result = DID_RESET << 16}; scsi_host_busy_iter(target->scsi_host, srp_terminate_cmd, &context); } for (i = 0; i < target->ch_count; i++) { ch = &target->ch[i]; /* * Whether or not creating a new CM ID succeeded, create a new * QP. This guarantees that all completion callback function * invocations have finished before request resetting starts. */ ret += srp_create_ch_ib(ch); INIT_LIST_HEAD(&ch->free_tx); for (j = 0; j < target->queue_size; ++j) list_add(&ch->tx_ring[j]->list, &ch->free_tx); } target->qp_in_error = false; for (i = 0; i < target->ch_count; i++) { ch = &target->ch[i]; if (ret) break; ret = srp_connect_ch(ch, max_iu_len, multich); multich = true; } if (ret == 0) shost_printk(KERN_INFO, target->scsi_host, PFX "reconnect succeeded\n"); return ret; } static void srp_map_desc(struct srp_map_state *state, dma_addr_t dma_addr, unsigned int dma_len, u32 rkey) { struct srp_direct_buf *desc = state->desc; WARN_ON_ONCE(!dma_len); desc->va = cpu_to_be64(dma_addr); desc->key = cpu_to_be32(rkey); desc->len = cpu_to_be32(dma_len); state->total_len += dma_len; state->desc++; state->ndesc++; } static void srp_reg_mr_err_done(struct ib_cq *cq, struct ib_wc *wc) { srp_handle_qp_err(cq, wc, "FAST REG"); } /* * Map up to sg_nents elements of state->sg where *sg_offset_p is the offset * where to start in the first element. If sg_offset_p != NULL then * *sg_offset_p is updated to the offset in state->sg[retval] of the first * byte that has not yet been mapped. */ static int srp_map_finish_fr(struct srp_map_state *state, struct srp_request *req, struct srp_rdma_ch *ch, int sg_nents, unsigned int *sg_offset_p) { struct srp_target_port *target = ch->target; struct srp_device *dev = target->srp_host->srp_dev; struct ib_reg_wr wr; struct srp_fr_desc *desc; u32 rkey; int n, err; if (state->fr.next >= state->fr.end) { shost_printk(KERN_ERR, ch->target->scsi_host, PFX "Out of MRs (mr_per_cmd = %d)\n", ch->target->mr_per_cmd); return -ENOMEM; } WARN_ON_ONCE(!dev->use_fast_reg); if (sg_nents == 1 && target->global_rkey) { unsigned int sg_offset = sg_offset_p ? *sg_offset_p : 0; srp_map_desc(state, sg_dma_address(state->sg) + sg_offset, sg_dma_len(state->sg) - sg_offset, target->global_rkey); if (sg_offset_p) *sg_offset_p = 0; return 1; } desc = srp_fr_pool_get(ch->fr_pool); if (!desc) return -ENOMEM; rkey = ib_inc_rkey(desc->mr->rkey); ib_update_fast_reg_key(desc->mr, rkey); n = ib_map_mr_sg(desc->mr, state->sg, sg_nents, sg_offset_p, dev->mr_page_size); if (unlikely(n < 0)) { srp_fr_pool_put(ch->fr_pool, &desc, 1); pr_debug("%s: ib_map_mr_sg(%d, %d) returned %d.\n", dev_name(&req->scmnd->device->sdev_gendev), sg_nents, sg_offset_p ? *sg_offset_p : -1, n); return n; } WARN_ON_ONCE(desc->mr->length == 0); req->reg_cqe.done = srp_reg_mr_err_done; wr.wr.next = NULL; wr.wr.opcode = IB_WR_REG_MR; wr.wr.wr_cqe = &req->reg_cqe; wr.wr.num_sge = 0; wr.wr.send_flags = 0; wr.mr = desc->mr; wr.key = desc->mr->rkey; wr.access = (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE); *state->fr.next++ = desc; state->nmdesc++; srp_map_desc(state, desc->mr->iova, desc->mr->length, desc->mr->rkey); err = ib_post_send(ch->qp, &wr.wr, NULL); if (unlikely(err)) { WARN_ON_ONCE(err == -ENOMEM); return err; } return n; } static int srp_map_sg_fr(struct srp_map_state *state, struct srp_rdma_ch *ch, struct srp_request *req, struct scatterlist *scat, int count) { unsigned int sg_offset = 0; state->fr.next = req->fr_list; state->fr.end = req->fr_list + ch->target->mr_per_cmd; state->sg = scat; if (count == 0) return 0; while (count) { int i, n; n = srp_map_finish_fr(state, req, ch, count, &sg_offset); if (unlikely(n < 0)) return n; count -= n; for (i = 0; i < n; i++) state->sg = sg_next(state->sg); } return 0; } static int srp_map_sg_dma(struct srp_map_state *state, struct srp_rdma_ch *ch, struct srp_request *req, struct scatterlist *scat, int count) { struct srp_target_port *target = ch->target; struct scatterlist *sg; int i; for_each_sg(scat, sg, count, i) { srp_map_desc(state, sg_dma_address(sg), sg_dma_len(sg), target->global_rkey); } return 0; } /* * Register the indirect data buffer descriptor with the HCA. * * Note: since the indirect data buffer descriptor has been allocated with * kmalloc() it is guaranteed that this buffer is a physically contiguous * memory buffer. */ static int srp_map_idb(struct srp_rdma_ch *ch, struct srp_request *req, void **next_mr, void **end_mr, u32 idb_len, __be32 *idb_rkey) { struct srp_target_port *target = ch->target; struct srp_device *dev = target->srp_host->srp_dev; struct srp_map_state state; struct srp_direct_buf idb_desc; struct scatterlist idb_sg[1]; int ret; memset(&state, 0, sizeof(state)); memset(&idb_desc, 0, sizeof(idb_desc)); state.gen.next = next_mr; state.gen.end = end_mr; state.desc = &idb_desc; state.base_dma_addr = req->indirect_dma_addr; state.dma_len = idb_len; if (dev->use_fast_reg) { state.sg = idb_sg; sg_init_one(idb_sg, req->indirect_desc, idb_len); idb_sg->dma_address = req->indirect_dma_addr; /* hack! */ #ifdef CONFIG_NEED_SG_DMA_LENGTH idb_sg->dma_length = idb_sg->length; /* hack^2 */ #endif ret = srp_map_finish_fr(&state, req, ch, 1, NULL); if (ret < 0) return ret; WARN_ON_ONCE(ret < 1); } else { return -EINVAL; } *idb_rkey = idb_desc.key; return 0; } static void srp_check_mapping(struct srp_map_state *state, struct srp_rdma_ch *ch, struct srp_request *req, struct scatterlist *scat, int count) { struct srp_device *dev = ch->target->srp_host->srp_dev; struct srp_fr_desc **pfr; u64 desc_len = 0, mr_len = 0; int i; for (i = 0; i < state->ndesc; i++) desc_len += be32_to_cpu(req->indirect_desc[i].len); if (dev->use_fast_reg) for (i = 0, pfr = req->fr_list; i < state->nmdesc; i++, pfr++) mr_len += (*pfr)->mr->length; if (desc_len != scsi_bufflen(req->scmnd) || mr_len > scsi_bufflen(req->scmnd)) pr_err("Inconsistent: scsi len %d <> desc len %lld <> mr len %lld; ndesc %d; nmdesc = %d\n", scsi_bufflen(req->scmnd), desc_len, mr_len, state->ndesc, state->nmdesc); } /** * srp_map_data() - map SCSI data buffer onto an SRP request * @scmnd: SCSI command to map * @ch: SRP RDMA channel * @req: SRP request * * Returns the length in bytes of the SRP_CMD IU or a negative value if * mapping failed. The size of any immediate data is not included in the * return value. */ static int srp_map_data(struct scsi_cmnd *scmnd, struct srp_rdma_ch *ch, struct srp_request *req) { struct srp_target_port *target = ch->target; struct scatterlist *scat, *sg; struct srp_cmd *cmd = req->cmd->buf; int i, len, nents, count, ret; struct srp_device *dev; struct ib_device *ibdev; struct srp_map_state state; struct srp_indirect_buf *indirect_hdr; u64 data_len; u32 idb_len, table_len; __be32 idb_rkey; u8 fmt; req->cmd->num_sge = 1; if (!scsi_sglist(scmnd) || scmnd->sc_data_direction == DMA_NONE) return sizeof(struct srp_cmd) + cmd->add_cdb_len; if (scmnd->sc_data_direction != DMA_FROM_DEVICE && scmnd->sc_data_direction != DMA_TO_DEVICE) { shost_printk(KERN_WARNING, target->scsi_host, PFX "Unhandled data direction %d\n", scmnd->sc_data_direction); return -EINVAL; } nents = scsi_sg_count(scmnd); scat = scsi_sglist(scmnd); data_len = scsi_bufflen(scmnd); dev = target->srp_host->srp_dev; ibdev = dev->dev; count = ib_dma_map_sg(ibdev, scat, nents, scmnd->sc_data_direction); if (unlikely(count == 0)) return -EIO; if (ch->use_imm_data && count <= ch->max_imm_sge && SRP_IMM_DATA_OFFSET + data_len <= ch->max_it_iu_len && scmnd->sc_data_direction == DMA_TO_DEVICE) { struct srp_imm_buf *buf; struct ib_sge *sge = &req->cmd->sge[1]; fmt = SRP_DATA_DESC_IMM; len = SRP_IMM_DATA_OFFSET; req->nmdesc = 0; buf = (void *)cmd->add_data + cmd->add_cdb_len; buf->len = cpu_to_be32(data_len); WARN_ON_ONCE((void *)(buf + 1) > (void *)cmd + len); for_each_sg(scat, sg, count, i) { sge[i].addr = sg_dma_address(sg); sge[i].length = sg_dma_len(sg); sge[i].lkey = target->lkey; } req->cmd->num_sge += count; goto map_complete; } fmt = SRP_DATA_DESC_DIRECT; len = sizeof(struct srp_cmd) + cmd->add_cdb_len + sizeof(struct srp_direct_buf); if (count == 1 && target->global_rkey) { /* * The midlayer only generated a single gather/scatter * entry, or DMA mapping coalesced everything to a * single entry. So a direct descriptor along with * the DMA MR suffices. */ struct srp_direct_buf *buf; buf = (void *)cmd->add_data + cmd->add_cdb_len; buf->va = cpu_to_be64(sg_dma_address(scat)); buf->key = cpu_to_be32(target->global_rkey); buf->len = cpu_to_be32(sg_dma_len(scat)); req->nmdesc = 0; goto map_complete; } /* * We have more than one scatter/gather entry, so build our indirect * descriptor table, trying to merge as many entries as we can. */ indirect_hdr = (void *)cmd->add_data + cmd->add_cdb_len; ib_dma_sync_single_for_cpu(ibdev, req->indirect_dma_addr, target->indirect_size, DMA_TO_DEVICE); memset(&state, 0, sizeof(state)); state.desc = req->indirect_desc; if (dev->use_fast_reg) ret = srp_map_sg_fr(&state, ch, req, scat, count); else ret = srp_map_sg_dma(&state, ch, req, scat, count); req->nmdesc = state.nmdesc; if (ret < 0) goto unmap; { DEFINE_DYNAMIC_DEBUG_METADATA(ddm, "Memory mapping consistency check"); if (DYNAMIC_DEBUG_BRANCH(ddm)) srp_check_mapping(&state, ch, req, scat, count); } /* We've mapped the request, now pull as much of the indirect * descriptor table as we can into the command buffer. If this * target is not using an external indirect table, we are * guaranteed to fit into the command, as the SCSI layer won't * give us more S/G entries than we allow. */ if (state.ndesc == 1) { /* * Memory registration collapsed the sg-list into one entry, * so use a direct descriptor. */ struct srp_direct_buf *buf; buf = (void *)cmd->add_data + cmd->add_cdb_len; *buf = req->indirect_desc[0]; goto map_complete; } if (unlikely(target->cmd_sg_cnt < state.ndesc && !target->allow_ext_sg)) { shost_printk(KERN_ERR, target->scsi_host, "Could not fit S/G list into SRP_CMD\n"); ret = -EIO; goto unmap; } count = min(state.ndesc, target->cmd_sg_cnt); table_len = state.ndesc * sizeof (struct srp_direct_buf); idb_len = sizeof(struct srp_indirect_buf) + table_len; fmt = SRP_DATA_DESC_INDIRECT; len = sizeof(struct srp_cmd) + cmd->add_cdb_len + sizeof(struct srp_indirect_buf); len += count * sizeof (struct srp_direct_buf); memcpy(indirect_hdr->desc_list, req->indirect_desc, count * sizeof (struct srp_direct_buf)); if (!target->global_rkey) { ret = srp_map_idb(ch, req, state.gen.next, state.gen.end, idb_len, &idb_rkey); if (ret < 0) goto unmap; req->nmdesc++; } else { idb_rkey = cpu_to_be32(target->global_rkey); } indirect_hdr->table_desc.va = cpu_to_be64(req->indirect_dma_addr); indirect_hdr->table_desc.key = idb_rkey; indirect_hdr->table_desc.len = cpu_to_be32(table_len); indirect_hdr->len = cpu_to_be32(state.total_len); if (scmnd->sc_data_direction == DMA_TO_DEVICE) cmd->data_out_desc_cnt = count; else cmd->data_in_desc_cnt = count; ib_dma_sync_single_for_device(ibdev, req->indirect_dma_addr, table_len, DMA_TO_DEVICE); map_complete: if (scmnd->sc_data_direction == DMA_TO_DEVICE) cmd->buf_fmt = fmt << 4; else cmd->buf_fmt = fmt; return len; unmap: srp_unmap_data(scmnd, ch, req); if (ret == -ENOMEM && req->nmdesc >= target->mr_pool_size) ret = -E2BIG; return ret; } /* * Return an IU and possible credit to the free pool */ static void srp_put_tx_iu(struct srp_rdma_ch *ch, struct srp_iu *iu, enum srp_iu_type iu_type) { unsigned long flags; spin_lock_irqsave(&ch->lock, flags); list_add(&iu->list, &ch->free_tx); if (iu_type != SRP_IU_RSP) ++ch->req_lim; spin_unlock_irqrestore(&ch->lock, flags); } /* * Must be called with ch->lock held to protect req_lim and free_tx. * If IU is not sent, it must be returned using srp_put_tx_iu(). * * Note: * An upper limit for the number of allocated information units for each * request type is: * - SRP_IU_CMD: SRP_CMD_SQ_SIZE, since the SCSI mid-layer never queues * more than Scsi_Host.can_queue requests. * - SRP_IU_TSK_MGMT: SRP_TSK_MGMT_SQ_SIZE. * - SRP_IU_RSP: 1, since a conforming SRP target never sends more than * one unanswered SRP request to an initiator. */ static struct srp_iu *__srp_get_tx_iu(struct srp_rdma_ch *ch, enum srp_iu_type iu_type) { struct srp_target_port *target = ch->target; s32 rsv = (iu_type == SRP_IU_TSK_MGMT) ? 0 : SRP_TSK_MGMT_SQ_SIZE; struct srp_iu *iu; lockdep_assert_held(&ch->lock); ib_process_cq_direct(ch->send_cq, -1); if (list_empty(&ch->free_tx)) return NULL; /* Initiator responses to target requests do not consume credits */ if (iu_type != SRP_IU_RSP) { if (ch->req_lim <= rsv) { ++target->zero_req_lim; return NULL; } --ch->req_lim; } iu = list_first_entry(&ch->free_tx, struct srp_iu, list); list_del(&iu->list); return iu; } /* * Note: if this function is called from inside ib_drain_sq() then it will * be called without ch->lock being held. If ib_drain_sq() dequeues a WQE * with status IB_WC_SUCCESS then that's a bug. */ static void srp_send_done(struct ib_cq *cq, struct ib_wc *wc) { struct srp_iu *iu = container_of(wc->wr_cqe, struct srp_iu, cqe); struct srp_rdma_ch *ch = cq->cq_context; if (unlikely(wc->status != IB_WC_SUCCESS)) { srp_handle_qp_err(cq, wc, "SEND"); return; } lockdep_assert_held(&ch->lock); list_add(&iu->list, &ch->free_tx); } /** * srp_post_send() - send an SRP information unit * @ch: RDMA channel over which to send the information unit. * @iu: Information unit to send. * @len: Length of the information unit excluding immediate data. */ static int srp_post_send(struct srp_rdma_ch *ch, struct srp_iu *iu, int len) { struct srp_target_port *target = ch->target; struct ib_send_wr wr; if (WARN_ON_ONCE(iu->num_sge > SRP_MAX_SGE)) return -EINVAL; iu->sge[0].addr = iu->dma; iu->sge[0].length = len; iu->sge[0].lkey = target->lkey; iu->cqe.done = srp_send_done; wr.next = NULL; wr.wr_cqe = &iu->cqe; wr.sg_list = &iu->sge[0]; wr.num_sge = iu->num_sge; wr.opcode = IB_WR_SEND; wr.send_flags = IB_SEND_SIGNALED; return ib_post_send(ch->qp, &wr, NULL); } static int srp_post_recv(struct srp_rdma_ch *ch, struct srp_iu *iu) { struct srp_target_port *target = ch->target; struct ib_recv_wr wr; struct ib_sge list; list.addr = iu->dma; list.length = iu->size; list.lkey = target->lkey; iu->cqe.done = srp_recv_done; wr.next = NULL; wr.wr_cqe = &iu->cqe; wr.sg_list = &list; wr.num_sge = 1; return ib_post_recv(ch->qp, &wr, NULL); } static void srp_process_rsp(struct srp_rdma_ch *ch, struct srp_rsp *rsp) { struct srp_target_port *target = ch->target; struct srp_request *req; struct scsi_cmnd *scmnd; unsigned long flags; if (unlikely(rsp->tag & SRP_TAG_TSK_MGMT)) { spin_lock_irqsave(&ch->lock, flags); ch->req_lim += be32_to_cpu(rsp->req_lim_delta); if (rsp->tag == ch->tsk_mgmt_tag) { ch->tsk_mgmt_status = -1; if (be32_to_cpu(rsp->resp_data_len) >= 4) ch->tsk_mgmt_status = rsp->data[3]; complete(&ch->tsk_mgmt_done); } else { shost_printk(KERN_ERR, target->scsi_host, "Received tsk mgmt response too late for tag %#llx\n", rsp->tag); } spin_unlock_irqrestore(&ch->lock, flags); } else { scmnd = scsi_host_find_tag(target->scsi_host, rsp->tag); if (scmnd) { req = scsi_cmd_priv(scmnd); scmnd = srp_claim_req(ch, req, NULL, scmnd); } if (!scmnd) { shost_printk(KERN_ERR, target->scsi_host, "Null scmnd for RSP w/tag %#016llx received on ch %td / QP %#x\n", rsp->tag, ch - target->ch, ch->qp->qp_num); spin_lock_irqsave(&ch->lock, flags); ch->req_lim += be32_to_cpu(rsp->req_lim_delta); spin_unlock_irqrestore(&ch->lock, flags); return; } scmnd->result = rsp->status; if (rsp->flags & SRP_RSP_FLAG_SNSVALID) { memcpy(scmnd->sense_buffer, rsp->data + be32_to_cpu(rsp->resp_data_len), min_t(int, be32_to_cpu(rsp->sense_data_len), SCSI_SENSE_BUFFERSIZE)); } if (unlikely(rsp->flags & SRP_RSP_FLAG_DIUNDER)) scsi_set_resid(scmnd, be32_to_cpu(rsp->data_in_res_cnt)); else if (unlikely(rsp->flags & SRP_RSP_FLAG_DOUNDER)) scsi_set_resid(scmnd, be32_to_cpu(rsp->data_out_res_cnt)); srp_free_req(ch, req, scmnd, be32_to_cpu(rsp->req_lim_delta)); scsi_done(scmnd); } } static int srp_response_common(struct srp_rdma_ch *ch, s32 req_delta, void *rsp, int len) { struct srp_target_port *target = ch->target; struct ib_device *dev = target->srp_host->srp_dev->dev; unsigned long flags; struct srp_iu *iu; int err; spin_lock_irqsave(&ch->lock, flags); ch->req_lim += req_delta; iu = __srp_get_tx_iu(ch, SRP_IU_RSP); spin_unlock_irqrestore(&ch->lock, flags); if (!iu) { shost_printk(KERN_ERR, target->scsi_host, PFX "no IU available to send response\n"); return 1; } iu->num_sge = 1; ib_dma_sync_single_for_cpu(dev, iu->dma, len, DMA_TO_DEVICE); memcpy(iu->buf, rsp, len); ib_dma_sync_single_for_device(dev, iu->dma, len, DMA_TO_DEVICE); err = srp_post_send(ch, iu, len); if (err) { shost_printk(KERN_ERR, target->scsi_host, PFX "unable to post response: %d\n", err); srp_put_tx_iu(ch, iu, SRP_IU_RSP); } return err; } static void srp_process_cred_req(struct srp_rdma_ch *ch, struct srp_cred_req *req) { struct srp_cred_rsp rsp = { .opcode = SRP_CRED_RSP, .tag = req->tag, }; s32 delta = be32_to_cpu(req->req_lim_delta); if (srp_response_common(ch, delta, &rsp, sizeof(rsp))) shost_printk(KERN_ERR, ch->target->scsi_host, PFX "problems processing SRP_CRED_REQ\n"); } static void srp_process_aer_req(struct srp_rdma_ch *ch, struct srp_aer_req *req) { struct srp_target_port *target = ch->target; struct srp_aer_rsp rsp = { .opcode = SRP_AER_RSP, .tag = req->tag, }; s32 delta = be32_to_cpu(req->req_lim_delta); shost_printk(KERN_ERR, target->scsi_host, PFX "ignoring AER for LUN %llu\n", scsilun_to_int(&req->lun)); if (srp_response_common(ch, delta, &rsp, sizeof(rsp))) shost_printk(KERN_ERR, target->scsi_host, PFX "problems processing SRP_AER_REQ\n"); } static void srp_recv_done(struct ib_cq *cq, struct ib_wc *wc) { struct srp_iu *iu = container_of(wc->wr_cqe, struct srp_iu, cqe); struct srp_rdma_ch *ch = cq->cq_context; struct srp_target_port *target = ch->target; struct ib_device *dev = target->srp_host->srp_dev->dev; int res; u8 opcode; if (unlikely(wc->status != IB_WC_SUCCESS)) { srp_handle_qp_err(cq, wc, "RECV"); return; } ib_dma_sync_single_for_cpu(dev, iu->dma, ch->max_ti_iu_len, DMA_FROM_DEVICE); opcode = *(u8 *) iu->buf; if (0) { shost_printk(KERN_ERR, target->scsi_host, PFX "recv completion, opcode 0x%02x\n", opcode); print_hex_dump(KERN_ERR, "", DUMP_PREFIX_OFFSET, 8, 1, iu->buf, wc->byte_len, true); } switch (opcode) { case SRP_RSP: srp_process_rsp(ch, iu->buf); break; case SRP_CRED_REQ: srp_process_cred_req(ch, iu->buf); break; case SRP_AER_REQ: srp_process_aer_req(ch, iu->buf); break; case SRP_T_LOGOUT: /* XXX Handle target logout */ shost_printk(KERN_WARNING, target->scsi_host, PFX "Got target logout request\n"); break; default: shost_printk(KERN_WARNING, target->scsi_host, PFX "Unhandled SRP opcode 0x%02x\n", opcode); break; } ib_dma_sync_single_for_device(dev, iu->dma, ch->max_ti_iu_len, DMA_FROM_DEVICE); res = srp_post_recv(ch, iu); if (res != 0) shost_printk(KERN_ERR, target->scsi_host, PFX "Recv failed with error code %d\n", res); } /** * srp_tl_err_work() - handle a transport layer error * @work: Work structure embedded in an SRP target port. * * Note: This function may get invoked before the rport has been created, * hence the target->rport test. */ static void srp_tl_err_work(struct work_struct *work) { struct srp_target_port *target; target = container_of(work, struct srp_target_port, tl_err_work); if (target->rport) srp_start_tl_fail_timers(target->rport); } static void srp_handle_qp_err(struct ib_cq *cq, struct ib_wc *wc, const char *opname) { struct srp_rdma_ch *ch = cq->cq_context; struct srp_target_port *target = ch->target; if (ch->connected && !target->qp_in_error) { shost_printk(KERN_ERR, target->scsi_host, PFX "failed %s status %s (%d) for CQE %p\n", opname, ib_wc_status_msg(wc->status), wc->status, wc->wr_cqe); queue_work(system_long_wq, &target->tl_err_work); } target->qp_in_error = true; } static int srp_queuecommand(struct Scsi_Host *shost, struct scsi_cmnd *scmnd) { struct request *rq = scsi_cmd_to_rq(scmnd); struct srp_target_port *target = host_to_target(shost); struct srp_rdma_ch *ch; struct srp_request *req = scsi_cmd_priv(scmnd); struct srp_iu *iu; struct srp_cmd *cmd; struct ib_device *dev; unsigned long flags; u32 tag; int len, ret; scmnd->result = srp_chkready(target->rport); if (unlikely(scmnd->result)) goto err; WARN_ON_ONCE(rq->tag < 0); tag = blk_mq_unique_tag(rq); ch = &target->ch[blk_mq_unique_tag_to_hwq(tag)]; spin_lock_irqsave(&ch->lock, flags); iu = __srp_get_tx_iu(ch, SRP_IU_CMD); spin_unlock_irqrestore(&ch->lock, flags); if (!iu) goto err; dev = target->srp_host->srp_dev->dev; ib_dma_sync_single_for_cpu(dev, iu->dma, ch->max_it_iu_len, DMA_TO_DEVICE); cmd = iu->buf; memset(cmd, 0, sizeof *cmd); cmd->opcode = SRP_CMD; int_to_scsilun(scmnd->device->lun, &cmd->lun); cmd->tag = tag; memcpy(cmd->cdb, scmnd->cmnd, scmnd->cmd_len); if (unlikely(scmnd->cmd_len > sizeof(cmd->cdb))) { cmd->add_cdb_len = round_up(scmnd->cmd_len - sizeof(cmd->cdb), 4); if (WARN_ON_ONCE(cmd->add_cdb_len > SRP_MAX_ADD_CDB_LEN)) goto err_iu; } req->scmnd = scmnd; req->cmd = iu; len = srp_map_data(scmnd, ch, req); if (len < 0) { shost_printk(KERN_ERR, target->scsi_host, PFX "Failed to map data (%d)\n", len); /* * If we ran out of memory descriptors (-ENOMEM) because an * application is queuing many requests with more than * max_pages_per_mr sg-list elements, tell the SCSI mid-layer * to reduce queue depth temporarily. */ scmnd->result = len == -ENOMEM ? DID_OK << 16 | SAM_STAT_TASK_SET_FULL : DID_ERROR << 16; goto err_iu; } ib_dma_sync_single_for_device(dev, iu->dma, ch->max_it_iu_len, DMA_TO_DEVICE); if (srp_post_send(ch, iu, len)) { shost_printk(KERN_ERR, target->scsi_host, PFX "Send failed\n"); scmnd->result = DID_ERROR << 16; goto err_unmap; } return 0; err_unmap: srp_unmap_data(scmnd, ch, req); err_iu: srp_put_tx_iu(ch, iu, SRP_IU_CMD); /* * Avoid that the loops that iterate over the request ring can * encounter a dangling SCSI command pointer. */ req->scmnd = NULL; err: if (scmnd->result) { scsi_done(scmnd); ret = 0; } else { ret = SCSI_MLQUEUE_HOST_BUSY; } return ret; } /* * Note: the resources allocated in this function are freed in * srp_free_ch_ib(). */ static int srp_alloc_iu_bufs(struct srp_rdma_ch *ch) { struct srp_target_port *target = ch->target; int i; ch->rx_ring = kcalloc(target->queue_size, sizeof(*ch->rx_ring), GFP_KERNEL); if (!ch->rx_ring) goto err_no_ring; ch->tx_ring = kcalloc(target->queue_size, sizeof(*ch->tx_ring), GFP_KERNEL); if (!ch->tx_ring) goto err_no_ring; for (i = 0; i < target->queue_size; ++i) { ch->rx_ring[i] = srp_alloc_iu(target->srp_host, ch->max_ti_iu_len, GFP_KERNEL, DMA_FROM_DEVICE); if (!ch->rx_ring[i]) goto err; } for (i = 0; i < target->queue_size; ++i) { ch->tx_ring[i] = srp_alloc_iu(target->srp_host, ch->max_it_iu_len, GFP_KERNEL, DMA_TO_DEVICE); if (!ch->tx_ring[i]) goto err; list_add(&ch->tx_ring[i]->list, &ch->free_tx); } return 0; err: for (i = 0; i < target->queue_size; ++i) { srp_free_iu(target->srp_host, ch->rx_ring[i]); srp_free_iu(target->srp_host, ch->tx_ring[i]); } err_no_ring: kfree(ch->tx_ring); ch->tx_ring = NULL; kfree(ch->rx_ring); ch->rx_ring = NULL; return -ENOMEM; } static uint32_t srp_compute_rq_tmo(struct ib_qp_attr *qp_attr, int attr_mask) { uint64_t T_tr_ns, max_compl_time_ms; uint32_t rq_tmo_jiffies; /* * According to section 11.2.4.2 in the IBTA spec (Modify Queue Pair, * table 91), both the QP timeout and the retry count have to be set * for RC QP's during the RTR to RTS transition. */ WARN_ON_ONCE((attr_mask & (IB_QP_TIMEOUT | IB_QP_RETRY_CNT)) != (IB_QP_TIMEOUT | IB_QP_RETRY_CNT)); /* * Set target->rq_tmo_jiffies to one second more than the largest time * it can take before an error completion is generated. See also * C9-140..142 in the IBTA spec for more information about how to * convert the QP Local ACK Timeout value to nanoseconds. */ T_tr_ns = 4096 * (1ULL << qp_attr->timeout); max_compl_time_ms = qp_attr->retry_cnt * 4 * T_tr_ns; do_div(max_compl_time_ms, NSEC_PER_MSEC); rq_tmo_jiffies = msecs_to_jiffies(max_compl_time_ms + 1000); return rq_tmo_jiffies; } static void srp_cm_rep_handler(struct ib_cm_id *cm_id, const struct srp_login_rsp *lrsp, struct srp_rdma_ch *ch) { struct srp_target_port *target = ch->target; struct ib_qp_attr *qp_attr = NULL; int attr_mask = 0; int ret = 0; int i; if (lrsp->opcode == SRP_LOGIN_RSP) { ch->max_ti_iu_len = be32_to_cpu(lrsp->max_ti_iu_len); ch->req_lim = be32_to_cpu(lrsp->req_lim_delta); ch->use_imm_data = srp_use_imm_data && (lrsp->rsp_flags & SRP_LOGIN_RSP_IMMED_SUPP); ch->max_it_iu_len = srp_max_it_iu_len(target->cmd_sg_cnt, ch->use_imm_data, target->max_it_iu_size); WARN_ON_ONCE(ch->max_it_iu_len > be32_to_cpu(lrsp->max_it_iu_len)); if (ch->use_imm_data) shost_printk(KERN_DEBUG, target->scsi_host, PFX "using immediate data\n"); /* * Reserve credits for task management so we don't * bounce requests back to the SCSI mid-layer. */ target->scsi_host->can_queue = min(ch->req_lim - SRP_TSK_MGMT_SQ_SIZE, target->scsi_host->can_queue); target->scsi_host->cmd_per_lun = min_t(int, target->scsi_host->can_queue, target->scsi_host->cmd_per_lun); } else { shost_printk(KERN_WARNING, target->scsi_host, PFX "Unhandled RSP opcode %#x\n", lrsp->opcode); ret = -ECONNRESET; goto error; } if (!ch->rx_ring) { ret = srp_alloc_iu_bufs(ch); if (ret) goto error; } for (i = 0; i < target->queue_size; i++) { struct srp_iu *iu = ch->rx_ring[i]; ret = srp_post_recv(ch, iu); if (ret) goto error; } if (!target->using_rdma_cm) { ret = -ENOMEM; qp_attr = kmalloc(sizeof(*qp_attr), GFP_KERNEL); if (!qp_attr) goto error; qp_attr->qp_state = IB_QPS_RTR; ret = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask); if (ret) goto error_free; ret = ib_modify_qp(ch->qp, qp_attr, attr_mask); if (ret) goto error_free; qp_attr->qp_state = IB_QPS_RTS; ret = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask); if (ret) goto error_free; target->rq_tmo_jiffies = srp_compute_rq_tmo(qp_attr, attr_mask); ret = ib_modify_qp(ch->qp, qp_attr, attr_mask); if (ret) goto error_free; ret = ib_send_cm_rtu(cm_id, NULL, 0); } error_free: kfree(qp_attr); error: ch->status = ret; } static void srp_ib_cm_rej_handler(struct ib_cm_id *cm_id, const struct ib_cm_event *event, struct srp_rdma_ch *ch) { struct srp_target_port *target = ch->target; struct Scsi_Host *shost = target->scsi_host; struct ib_class_port_info *cpi; int opcode; u16 dlid; switch (event->param.rej_rcvd.reason) { case IB_CM_REJ_PORT_CM_REDIRECT: cpi = event->param.rej_rcvd.ari; dlid = be16_to_cpu(cpi->redirect_lid); sa_path_set_dlid(&ch->ib_cm.path, dlid); ch->ib_cm.path.pkey = cpi->redirect_pkey; cm_id->remote_cm_qpn = be32_to_cpu(cpi->redirect_qp) & 0x00ffffff; memcpy(ch->ib_cm.path.dgid.raw, cpi->redirect_gid, 16); ch->status = dlid ? SRP_DLID_REDIRECT : SRP_PORT_REDIRECT; break; case IB_CM_REJ_PORT_REDIRECT: if (srp_target_is_topspin(target)) { union ib_gid *dgid = &ch->ib_cm.path.dgid; /* * Topspin/Cisco SRP gateways incorrectly send * reject reason code 25 when they mean 24 * (port redirect). */ memcpy(dgid->raw, event->param.rej_rcvd.ari, 16); shost_printk(KERN_DEBUG, shost, PFX "Topspin/Cisco redirect to target port GID %016llx%016llx\n", be64_to_cpu(dgid->global.subnet_prefix), be64_to_cpu(dgid->global.interface_id)); ch->status = SRP_PORT_REDIRECT; } else { shost_printk(KERN_WARNING, shost, " REJ reason: IB_CM_REJ_PORT_REDIRECT\n"); ch->status = -ECONNRESET; } break; case IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID: shost_printk(KERN_WARNING, shost, " REJ reason: IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID\n"); ch->status = -ECONNRESET; break; case IB_CM_REJ_CONSUMER_DEFINED: opcode = *(u8 *) event->private_data; if (opcode == SRP_LOGIN_REJ) { struct srp_login_rej *rej = event->private_data; u32 reason = be32_to_cpu(rej->reason); if (reason == SRP_LOGIN_REJ_REQ_IT_IU_LENGTH_TOO_LARGE) shost_printk(KERN_WARNING, shost, PFX "SRP_LOGIN_REJ: requested max_it_iu_len too large\n"); else shost_printk(KERN_WARNING, shost, PFX "SRP LOGIN from %pI6 to %pI6 REJECTED, reason 0x%08x\n", target->sgid.raw, target->ib_cm.orig_dgid.raw, reason); } else shost_printk(KERN_WARNING, shost, " REJ reason: IB_CM_REJ_CONSUMER_DEFINED," " opcode 0x%02x\n", opcode); ch->status = -ECONNRESET; break; case IB_CM_REJ_STALE_CONN: shost_printk(KERN_WARNING, shost, " REJ reason: stale connection\n"); ch->status = SRP_STALE_CONN; break; default: shost_printk(KERN_WARNING, shost, " REJ reason 0x%x\n", event->param.rej_rcvd.reason); ch->status = -ECONNRESET; } } static int srp_ib_cm_handler(struct ib_cm_id *cm_id, const struct ib_cm_event *event) { struct srp_rdma_ch *ch = cm_id->context; struct srp_target_port *target = ch->target; int comp = 0; switch (event->event) { case IB_CM_REQ_ERROR: shost_printk(KERN_DEBUG, target->scsi_host, PFX "Sending CM REQ failed\n"); comp = 1; ch->status = -ECONNRESET; break; case IB_CM_REP_RECEIVED: comp = 1; srp_cm_rep_handler(cm_id, event->private_data, ch); break; case IB_CM_REJ_RECEIVED: shost_printk(KERN_DEBUG, target->scsi_host, PFX "REJ received\n"); comp = 1; srp_ib_cm_rej_handler(cm_id, event, ch); break; case IB_CM_DREQ_RECEIVED: shost_printk(KERN_WARNING, target->scsi_host, PFX "DREQ received - connection closed\n"); ch->connected = false; if (ib_send_cm_drep(cm_id, NULL, 0)) shost_printk(KERN_ERR, target->scsi_host, PFX "Sending CM DREP failed\n"); queue_work(system_long_wq, &target->tl_err_work); break; case IB_CM_TIMEWAIT_EXIT: shost_printk(KERN_ERR, target->scsi_host, PFX "connection closed\n"); comp = 1; ch->status = 0; break; case IB_CM_MRA_RECEIVED: case IB_CM_DREQ_ERROR: case IB_CM_DREP_RECEIVED: break; default: shost_printk(KERN_WARNING, target->scsi_host, PFX "Unhandled CM event %d\n", event->event); break; } if (comp) complete(&ch->done); return 0; } static void srp_rdma_cm_rej_handler(struct srp_rdma_ch *ch, struct rdma_cm_event *event) { struct srp_target_port *target = ch->target; struct Scsi_Host *shost = target->scsi_host; int opcode; switch (event->status) { case IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID: shost_printk(KERN_WARNING, shost, " REJ reason: IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID\n"); ch->status = -ECONNRESET; break; case IB_CM_REJ_CONSUMER_DEFINED: opcode = *(u8 *) event->param.conn.private_data; if (opcode == SRP_LOGIN_REJ) { struct srp_login_rej *rej = (struct srp_login_rej *) event->param.conn.private_data; u32 reason = be32_to_cpu(rej->reason); if (reason == SRP_LOGIN_REJ_REQ_IT_IU_LENGTH_TOO_LARGE) shost_printk(KERN_WARNING, shost, PFX "SRP_LOGIN_REJ: requested max_it_iu_len too large\n"); else shost_printk(KERN_WARNING, shost, PFX "SRP LOGIN REJECTED, reason 0x%08x\n", reason); } else { shost_printk(KERN_WARNING, shost, " REJ reason: IB_CM_REJ_CONSUMER_DEFINED, opcode 0x%02x\n", opcode); } ch->status = -ECONNRESET; break; case IB_CM_REJ_STALE_CONN: shost_printk(KERN_WARNING, shost, " REJ reason: stale connection\n"); ch->status = SRP_STALE_CONN; break; default: shost_printk(KERN_WARNING, shost, " REJ reason 0x%x\n", event->status); ch->status = -ECONNRESET; break; } } static int srp_rdma_cm_handler(struct rdma_cm_id *cm_id, struct rdma_cm_event *event) { struct srp_rdma_ch *ch = cm_id->context; struct srp_target_port *target = ch->target; int comp = 0; switch (event->event) { case RDMA_CM_EVENT_ADDR_RESOLVED: ch->status = 0; comp = 1; break; case RDMA_CM_EVENT_ADDR_ERROR: ch->status = -ENXIO; comp = 1; break; case RDMA_CM_EVENT_ROUTE_RESOLVED: ch->status = 0; comp = 1; break; case RDMA_CM_EVENT_ROUTE_ERROR: case RDMA_CM_EVENT_UNREACHABLE: ch->status = -EHOSTUNREACH; comp = 1; break; case RDMA_CM_EVENT_CONNECT_ERROR: shost_printk(KERN_DEBUG, target->scsi_host, PFX "Sending CM REQ failed\n"); comp = 1; ch->status = -ECONNRESET; break; case RDMA_CM_EVENT_ESTABLISHED: comp = 1; srp_cm_rep_handler(NULL, event->param.conn.private_data, ch); break; case RDMA_CM_EVENT_REJECTED: shost_printk(KERN_DEBUG, target->scsi_host, PFX "REJ received\n"); comp = 1; srp_rdma_cm_rej_handler(ch, event); break; case RDMA_CM_EVENT_DISCONNECTED: if (ch->connected) { shost_printk(KERN_WARNING, target->scsi_host, PFX "received DREQ\n"); rdma_disconnect(ch->rdma_cm.cm_id); comp = 1; ch->status = 0; queue_work(system_long_wq, &target->tl_err_work); } break; case RDMA_CM_EVENT_TIMEWAIT_EXIT: shost_printk(KERN_ERR, target->scsi_host, PFX "connection closed\n"); comp = 1; ch->status = 0; break; default: shost_printk(KERN_WARNING, target->scsi_host, PFX "Unhandled CM event %d\n", event->event); break; } if (comp) complete(&ch->done); return 0; } /** * srp_change_queue_depth - setting device queue depth * @sdev: scsi device struct * @qdepth: requested queue depth * * Returns queue depth. */ static int srp_change_queue_depth(struct scsi_device *sdev, int qdepth) { if (!sdev->tagged_supported) qdepth = 1; return scsi_change_queue_depth(sdev, qdepth); } static int srp_send_tsk_mgmt(struct srp_rdma_ch *ch, u64 req_tag, u64 lun, u8 func, u8 *status) { struct srp_target_port *target = ch->target; struct srp_rport *rport = target->rport; struct ib_device *dev = target->srp_host->srp_dev->dev; struct srp_iu *iu; struct srp_tsk_mgmt *tsk_mgmt; int res; if (!ch->connected || target->qp_in_error) return -1; /* * Lock the rport mutex to avoid that srp_create_ch_ib() is * invoked while a task management function is being sent. */ mutex_lock(&rport->mutex); spin_lock_irq(&ch->lock); iu = __srp_get_tx_iu(ch, SRP_IU_TSK_MGMT); spin_unlock_irq(&ch->lock); if (!iu) { mutex_unlock(&rport->mutex); return -1; } iu->num_sge = 1; ib_dma_sync_single_for_cpu(dev, iu->dma, sizeof *tsk_mgmt, DMA_TO_DEVICE); tsk_mgmt = iu->buf; memset(tsk_mgmt, 0, sizeof *tsk_mgmt); tsk_mgmt->opcode = SRP_TSK_MGMT; int_to_scsilun(lun, &tsk_mgmt->lun); tsk_mgmt->tsk_mgmt_func = func; tsk_mgmt->task_tag = req_tag; spin_lock_irq(&ch->lock); ch->tsk_mgmt_tag = (ch->tsk_mgmt_tag + 1) | SRP_TAG_TSK_MGMT; tsk_mgmt->tag = ch->tsk_mgmt_tag; spin_unlock_irq(&ch->lock); init_completion(&ch->tsk_mgmt_done); ib_dma_sync_single_for_device(dev, iu->dma, sizeof *tsk_mgmt, DMA_TO_DEVICE); if (srp_post_send(ch, iu, sizeof(*tsk_mgmt))) { srp_put_tx_iu(ch, iu, SRP_IU_TSK_MGMT); mutex_unlock(&rport->mutex); return -1; } res = wait_for_completion_timeout(&ch->tsk_mgmt_done, msecs_to_jiffies(SRP_ABORT_TIMEOUT_MS)); if (res > 0 && status) *status = ch->tsk_mgmt_status; mutex_unlock(&rport->mutex); WARN_ON_ONCE(res < 0); return res > 0 ? 0 : -1; } static int srp_abort(struct scsi_cmnd *scmnd) { struct srp_target_port *target = host_to_target(scmnd->device->host); struct srp_request *req = scsi_cmd_priv(scmnd); u32 tag; u16 ch_idx; struct srp_rdma_ch *ch; shost_printk(KERN_ERR, target->scsi_host, "SRP abort called\n"); tag = blk_mq_unique_tag(scsi_cmd_to_rq(scmnd)); ch_idx = blk_mq_unique_tag_to_hwq(tag); if (WARN_ON_ONCE(ch_idx >= target->ch_count)) return SUCCESS; ch = &target->ch[ch_idx]; if (!srp_claim_req(ch, req, NULL, scmnd)) return SUCCESS; shost_printk(KERN_ERR, target->scsi_host, "Sending SRP abort for tag %#x\n", tag); if (srp_send_tsk_mgmt(ch, tag, scmnd->device->lun, SRP_TSK_ABORT_TASK, NULL) == 0) { srp_free_req(ch, req, scmnd, 0); return SUCCESS; } if (target->rport->state == SRP_RPORT_LOST) return FAST_IO_FAIL; return FAILED; } static int srp_reset_device(struct scsi_cmnd *scmnd) { struct srp_target_port *target = host_to_target(scmnd->device->host); struct srp_rdma_ch *ch; u8 status; shost_printk(KERN_ERR, target->scsi_host, "SRP reset_device called\n"); ch = &target->ch[0]; if (srp_send_tsk_mgmt(ch, SRP_TAG_NO_REQ, scmnd->device->lun, SRP_TSK_LUN_RESET, &status)) return FAILED; if (status) return FAILED; return SUCCESS; } static int srp_reset_host(struct scsi_cmnd *scmnd) { struct srp_target_port *target = host_to_target(scmnd->device->host); shost_printk(KERN_ERR, target->scsi_host, PFX "SRP reset_host called\n"); return srp_reconnect_rport(target->rport) == 0 ? SUCCESS : FAILED; } static int srp_target_alloc(struct scsi_target *starget) { struct Scsi_Host *shost = dev_to_shost(starget->dev.parent); struct srp_target_port *target = host_to_target(shost); if (target->target_can_queue) starget->can_queue = target->target_can_queue; return 0; } static int srp_sdev_configure(struct scsi_device *sdev, struct queue_limits *lim) { struct Scsi_Host *shost = sdev->host; struct srp_target_port *target = host_to_target(shost); struct request_queue *q = sdev->request_queue; unsigned long timeout; if (sdev->type == TYPE_DISK) { timeout = max_t(unsigned, 30 * HZ, target->rq_tmo_jiffies); blk_queue_rq_timeout(q, timeout); } return 0; } static ssize_t id_ext_show(struct device *dev, struct device_attribute *attr, char *buf) { struct srp_target_port *target = host_to_target(class_to_shost(dev)); return sysfs_emit(buf, "0x%016llx\n", be64_to_cpu(target->id_ext)); } static DEVICE_ATTR_RO(id_ext); static ssize_t ioc_guid_show(struct device *dev, struct device_attribute *attr, char *buf) { struct srp_target_port *target = host_to_target(class_to_shost(dev)); return sysfs_emit(buf, "0x%016llx\n", be64_to_cpu(target->ioc_guid)); } static DEVICE_ATTR_RO(ioc_guid); static ssize_t service_id_show(struct device *dev, struct device_attribute *attr, char *buf) { struct srp_target_port *target = host_to_target(class_to_shost(dev)); if (target->using_rdma_cm) return -ENOENT; return sysfs_emit(buf, "0x%016llx\n", be64_to_cpu(target->ib_cm.service_id)); } static DEVICE_ATTR_RO(service_id); static ssize_t pkey_show(struct device *dev, struct device_attribute *attr, char *buf) { struct srp_target_port *target = host_to_target(class_to_shost(dev)); if (target->using_rdma_cm) return -ENOENT; return sysfs_emit(buf, "0x%04x\n", be16_to_cpu(target->ib_cm.pkey)); } static DEVICE_ATTR_RO(pkey); static ssize_t sgid_show(struct device *dev, struct device_attribute *attr, char *buf) { struct srp_target_port *target = host_to_target(class_to_shost(dev)); return sysfs_emit(buf, "%pI6\n", target->sgid.raw); } static DEVICE_ATTR_RO(sgid); static ssize_t dgid_show(struct device *dev, struct device_attribute *attr, char *buf) { struct srp_target_port *target = host_to_target(class_to_shost(dev)); struct srp_rdma_ch *ch = &target->ch[0]; if (target->using_rdma_cm) return -ENOENT; return sysfs_emit(buf, "%pI6\n", ch->ib_cm.path.dgid.raw); } static DEVICE_ATTR_RO(dgid); static ssize_t orig_dgid_show(struct device *dev, struct device_attribute *attr, char *buf) { struct srp_target_port *target = host_to_target(class_to_shost(dev)); if (target->using_rdma_cm) return -ENOENT; return sysfs_emit(buf, "%pI6\n", target->ib_cm.orig_dgid.raw); } static DEVICE_ATTR_RO(orig_dgid); static ssize_t req_lim_show(struct device *dev, struct device_attribute *attr, char *buf) { struct srp_target_port *target = host_to_target(class_to_shost(dev)); struct srp_rdma_ch *ch; int i, req_lim = INT_MAX; for (i = 0; i < target->ch_count; i++) { ch = &target->ch[i]; req_lim = min(req_lim, ch->req_lim); } return sysfs_emit(buf, "%d\n", req_lim); } static DEVICE_ATTR_RO(req_lim); static ssize_t zero_req_lim_show(struct device *dev, struct device_attribute *attr, char *buf) { struct srp_target_port *target = host_to_target(class_to_shost(dev)); return sysfs_emit(buf, "%d\n", target->zero_req_lim); } static DEVICE_ATTR_RO(zero_req_lim); static ssize_t local_ib_port_show(struct device *dev, struct device_attribute *attr, char *buf) { struct srp_target_port *target = host_to_target(class_to_shost(dev)); return sysfs_emit(buf, "%u\n", target->srp_host->port); } static DEVICE_ATTR_RO(local_ib_port); static ssize_t local_ib_device_show(struct device *dev, struct device_attribute *attr, char *buf) { struct srp_target_port *target = host_to_target(class_to_shost(dev)); return sysfs_emit(buf, "%s\n", dev_name(&target->srp_host->srp_dev->dev->dev)); } static DEVICE_ATTR_RO(local_ib_device); static ssize_t ch_count_show(struct device *dev, struct device_attribute *attr, char *buf) { struct srp_target_port *target = host_to_target(class_to_shost(dev)); return sysfs_emit(buf, "%d\n", target->ch_count); } static DEVICE_ATTR_RO(ch_count); static ssize_t comp_vector_show(struct device *dev, struct device_attribute *attr, char *buf) { struct srp_target_port *target = host_to_target(class_to_shost(dev)); return sysfs_emit(buf, "%d\n", target->comp_vector); } static DEVICE_ATTR_RO(comp_vector); static ssize_t tl_retry_count_show(struct device *dev, struct device_attribute *attr, char *buf) { struct srp_target_port *target = host_to_target(class_to_shost(dev)); return sysfs_emit(buf, "%d\n", target->tl_retry_count); } static DEVICE_ATTR_RO(tl_retry_count); static ssize_t cmd_sg_entries_show(struct device *dev, struct device_attribute *attr, char *buf) { struct srp_target_port *target = host_to_target(class_to_shost(dev)); return sysfs_emit(buf, "%u\n", target->cmd_sg_cnt); } static DEVICE_ATTR_RO(cmd_sg_entries); static ssize_t allow_ext_sg_show(struct device *dev, struct device_attribute *attr, char *buf) { struct srp_target_port *target = host_to_target(class_to_shost(dev)); return sysfs_emit(buf, "%s\n", target->allow_ext_sg ? "true" : "false"); } static DEVICE_ATTR_RO(allow_ext_sg); static struct attribute *srp_host_attrs[] = { &dev_attr_id_ext.attr, &dev_attr_ioc_guid.attr, &dev_attr_service_id.attr, &dev_attr_pkey.attr, &dev_attr_sgid.attr, &dev_attr_dgid.attr, &dev_attr_orig_dgid.attr, &dev_attr_req_lim.attr, &dev_attr_zero_req_lim.attr, &dev_attr_local_ib_port.attr, &dev_attr_local_ib_device.attr, &dev_attr_ch_count.attr, &dev_attr_comp_vector.attr, &dev_attr_tl_retry_count.attr, &dev_attr_cmd_sg_entries.attr, &dev_attr_allow_ext_sg.attr, NULL }; ATTRIBUTE_GROUPS(srp_host); static const struct scsi_host_template srp_template = { .module = THIS_MODULE, .name = "InfiniBand SRP initiator", .proc_name = DRV_NAME, .target_alloc = srp_target_alloc, .sdev_configure = srp_sdev_configure, .info = srp_target_info, .init_cmd_priv = srp_init_cmd_priv, .exit_cmd_priv = srp_exit_cmd_priv, .queuecommand = srp_queuecommand, .change_queue_depth = srp_change_queue_depth, .eh_timed_out = srp_timed_out, .eh_abort_handler = srp_abort, .eh_device_reset_handler = srp_reset_device, .eh_host_reset_handler = srp_reset_host, .skip_settle_delay = true, .sg_tablesize = SRP_DEF_SG_TABLESIZE, .can_queue = SRP_DEFAULT_CMD_SQ_SIZE, .this_id = -1, .cmd_per_lun = SRP_DEFAULT_CMD_SQ_SIZE, .shost_groups = srp_host_groups, .track_queue_depth = 1, .cmd_size = sizeof(struct srp_request), }; static int srp_sdev_count(struct Scsi_Host *host) { struct scsi_device *sdev; int c = 0; shost_for_each_device(sdev, host) c++; return c; } /* * Return values: * < 0 upon failure. Caller is responsible for SRP target port cleanup. * 0 and target->state == SRP_TARGET_REMOVED if asynchronous target port * removal has been scheduled. * 0 and target->state != SRP_TARGET_REMOVED upon success. */ static int srp_add_target(struct srp_host *host, struct srp_target_port *target) { struct srp_rport_identifiers ids; struct srp_rport *rport; target->state = SRP_TARGET_SCANNING; sprintf(target->target_name, "SRP.T10:%016llX", be64_to_cpu(target->id_ext)); if (scsi_add_host(target->scsi_host, host->srp_dev->dev->dev.parent)) return -ENODEV; memcpy(ids.port_id, &target->id_ext, 8); memcpy(ids.port_id + 8, &target->ioc_guid, 8); ids.roles = SRP_RPORT_ROLE_TARGET; rport = srp_rport_add(target->scsi_host, &ids); if (IS_ERR(rport)) { scsi_remove_host(target->scsi_host); return PTR_ERR(rport); } rport->lld_data = target; target->rport = rport; spin_lock(&host->target_lock); list_add_tail(&target->list, &host->target_list); spin_unlock(&host->target_lock); scsi_scan_target(&target->scsi_host->shost_gendev, 0, target->scsi_id, SCAN_WILD_CARD, SCSI_SCAN_INITIAL); if (srp_connected_ch(target) < target->ch_count || target->qp_in_error) { shost_printk(KERN_INFO, target->scsi_host, PFX "SCSI scan failed - removing SCSI host\n"); srp_queue_remove_work(target); goto out; } pr_debug("%s: SCSI scan succeeded - detected %d LUNs\n", dev_name(&target->scsi_host->shost_gendev), srp_sdev_count(target->scsi_host)); spin_lock_irq(&target->lock); if (target->state == SRP_TARGET_SCANNING) target->state = SRP_TARGET_LIVE; spin_unlock_irq(&target->lock); out: return 0; } static void srp_release_dev(struct device *dev) { struct srp_host *host = container_of(dev, struct srp_host, dev); kfree(host); } static struct attribute *srp_class_attrs[]; ATTRIBUTE_GROUPS(srp_class); static struct class srp_class = { .name = "infiniband_srp", .dev_groups = srp_class_groups, .dev_release = srp_release_dev }; /** * srp_conn_unique() - check whether the connection to a target is unique * @host: SRP host. * @target: SRP target port. */ static bool srp_conn_unique(struct srp_host *host, struct srp_target_port *target) { struct srp_target_port *t; bool ret = false; if (target->state == SRP_TARGET_REMOVED) goto out; ret = true; spin_lock(&host->target_lock); list_for_each_entry(t, &host->target_list, list) { if (t != target && target->id_ext == t->id_ext && target->ioc_guid == t->ioc_guid && target->initiator_ext == t->initiator_ext) { ret = false; break; } } spin_unlock(&host->target_lock); out: return ret; } /* * Target ports are added by writing * * id_ext=<SRP ID ext>,ioc_guid=<SRP IOC GUID>,dgid=<dest GID>, * pkey=<P_Key>,service_id=<service ID> * or * id_ext=<SRP ID ext>,ioc_guid=<SRP IOC GUID>, * [src=<IPv4 address>,]dest=<IPv4 address>:<port number> * * to the add_target sysfs attribute. */ enum { SRP_OPT_ERR = 0, SRP_OPT_ID_EXT = 1 << 0, SRP_OPT_IOC_GUID = 1 << 1, SRP_OPT_DGID = 1 << 2, SRP_OPT_PKEY = 1 << 3, SRP_OPT_SERVICE_ID = 1 << 4, SRP_OPT_MAX_SECT = 1 << 5, SRP_OPT_MAX_CMD_PER_LUN = 1 << 6, SRP_OPT_IO_CLASS = 1 << 7, SRP_OPT_INITIATOR_EXT = 1 << 8, SRP_OPT_CMD_SG_ENTRIES = 1 << 9, SRP_OPT_ALLOW_EXT_SG = 1 << 10, SRP_OPT_SG_TABLESIZE = 1 << 11, SRP_OPT_COMP_VECTOR = 1 << 12, SRP_OPT_TL_RETRY_COUNT = 1 << 13, SRP_OPT_QUEUE_SIZE = 1 << 14, SRP_OPT_IP_SRC = 1 << 15, SRP_OPT_IP_DEST = 1 << 16, SRP_OPT_TARGET_CAN_QUEUE= 1 << 17, SRP_OPT_MAX_IT_IU_SIZE = 1 << 18, SRP_OPT_CH_COUNT = 1 << 19, }; static unsigned int srp_opt_mandatory[] = { SRP_OPT_ID_EXT | SRP_OPT_IOC_GUID | SRP_OPT_DGID | SRP_OPT_PKEY | SRP_OPT_SERVICE_ID, SRP_OPT_ID_EXT | SRP_OPT_IOC_GUID | SRP_OPT_IP_DEST, }; static const match_table_t srp_opt_tokens = { { SRP_OPT_ID_EXT, "id_ext=%s" }, { SRP_OPT_IOC_GUID, "ioc_guid=%s" }, { SRP_OPT_DGID, "dgid=%s" }, { SRP_OPT_PKEY, "pkey=%x" }, { SRP_OPT_SERVICE_ID, "service_id=%s" }, { SRP_OPT_MAX_SECT, "max_sect=%d" }, { SRP_OPT_MAX_CMD_PER_LUN, "max_cmd_per_lun=%d" }, { SRP_OPT_TARGET_CAN_QUEUE, "target_can_queue=%d" }, { SRP_OPT_IO_CLASS, "io_class=%x" }, { SRP_OPT_INITIATOR_EXT, "initiator_ext=%s" }, { SRP_OPT_CMD_SG_ENTRIES, "cmd_sg_entries=%u" }, { SRP_OPT_ALLOW_EXT_SG, "allow_ext_sg=%u" }, { SRP_OPT_SG_TABLESIZE, "sg_tablesize=%u" }, { SRP_OPT_COMP_VECTOR, "comp_vector=%u" }, { SRP_OPT_TL_RETRY_COUNT, "tl_retry_count=%u" }, { SRP_OPT_QUEUE_SIZE, "queue_size=%d" }, { SRP_OPT_IP_SRC, "src=%s" }, { SRP_OPT_IP_DEST, "dest=%s" }, { SRP_OPT_MAX_IT_IU_SIZE, "max_it_iu_size=%d" }, { SRP_OPT_CH_COUNT, "ch_count=%u", }, { SRP_OPT_ERR, NULL } }; /** * srp_parse_in - parse an IP address and port number combination * @net: [in] Network namespace. * @sa: [out] Address family, IP address and port number. * @addr_port_str: [in] IP address and port number. * @has_port: [out] Whether or not @addr_port_str includes a port number. * * Parse the following address formats: * - IPv4: <ip_address>:<port>, e.g. 1.2.3.4:5. * - IPv6: \[<ipv6_address>\]:<port>, e.g. [1::2:3%4]:5. */ static int srp_parse_in(struct net *net, struct sockaddr_storage *sa, const char *addr_port_str, bool *has_port) { char *addr_end, *addr = kstrdup(addr_port_str, GFP_KERNEL); char *port_str; int ret; if (!addr) return -ENOMEM; port_str = strrchr(addr, ':'); if (port_str && strchr(port_str, ']')) port_str = NULL; if (port_str) *port_str++ = '\0'; if (has_port) *has_port = port_str != NULL; ret = inet_pton_with_scope(net, AF_INET, addr, port_str, sa); if (ret && addr[0]) { addr_end = addr + strlen(addr) - 1; if (addr[0] == '[' && *addr_end == ']') { *addr_end = '\0'; ret = inet_pton_with_scope(net, AF_INET6, addr + 1, port_str, sa); } } kfree(addr); pr_debug("%s -> %pISpfsc\n", addr_port_str, sa); return ret; } static int srp_parse_options(struct net *net, const char *buf, struct srp_target_port *target) { char *options, *sep_opt; char *p; substring_t args[MAX_OPT_ARGS]; unsigned long long ull; bool has_port; int opt_mask = 0; int token; int ret = -EINVAL; int i; options = kstrdup(buf, GFP_KERNEL); if (!options) return -ENOMEM; sep_opt = options; while ((p = strsep(&sep_opt, ",\n")) != NULL) { if (!*p) continue; token = match_token(p, srp_opt_tokens, args); opt_mask |= token; switch (token) { case SRP_OPT_ID_EXT: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } ret = kstrtoull(p, 16, &ull); if (ret) { pr_warn("invalid id_ext parameter '%s'\n", p); kfree(p); goto out; } target->id_ext = cpu_to_be64(ull); kfree(p); break; case SRP_OPT_IOC_GUID: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } ret = kstrtoull(p, 16, &ull); if (ret) { pr_warn("invalid ioc_guid parameter '%s'\n", p); kfree(p); goto out; } target->ioc_guid = cpu_to_be64(ull); kfree(p); break; case SRP_OPT_DGID: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } if (strlen(p) != 32) { pr_warn("bad dest GID parameter '%s'\n", p); kfree(p); goto out; } ret = hex2bin(target->ib_cm.orig_dgid.raw, p, 16); kfree(p); if (ret < 0) goto out; break; case SRP_OPT_PKEY: ret = match_hex(args, &token); if (ret) { pr_warn("bad P_Key parameter '%s'\n", p); goto out; } target->ib_cm.pkey = cpu_to_be16(token); break; case SRP_OPT_SERVICE_ID: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } ret = kstrtoull(p, 16, &ull); if (ret) { pr_warn("bad service_id parameter '%s'\n", p); kfree(p); goto out; } target->ib_cm.service_id = cpu_to_be64(ull); kfree(p); break; case SRP_OPT_IP_SRC: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } ret = srp_parse_in(net, &target->rdma_cm.src.ss, p, NULL); if (ret < 0) { pr_warn("bad source parameter '%s'\n", p); kfree(p); goto out; } target->rdma_cm.src_specified = true; kfree(p); break; case SRP_OPT_IP_DEST: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } ret = srp_parse_in(net, &target->rdma_cm.dst.ss, p, &has_port); if (!has_port) ret = -EINVAL; if (ret < 0) { pr_warn("bad dest parameter '%s'\n", p); kfree(p); goto out; } target->using_rdma_cm = true; kfree(p); break; case SRP_OPT_MAX_SECT: ret = match_int(args, &token); if (ret) { pr_warn("bad max sect parameter '%s'\n", p); goto out; } target->scsi_host->max_sectors = token; break; case SRP_OPT_QUEUE_SIZE: ret = match_int(args, &token); if (ret) { pr_warn("match_int() failed for queue_size parameter '%s', Error %d\n", p, ret); goto out; } if (token < 1) { pr_warn("bad queue_size parameter '%s'\n", p); ret = -EINVAL; goto out; } target->scsi_host->can_queue = token; target->queue_size = token + SRP_RSP_SQ_SIZE + SRP_TSK_MGMT_SQ_SIZE; if (!(opt_mask & SRP_OPT_MAX_CMD_PER_LUN)) target->scsi_host->cmd_per_lun = token; break; case SRP_OPT_MAX_CMD_PER_LUN: ret = match_int(args, &token); if (ret) { pr_warn("match_int() failed for max cmd_per_lun parameter '%s', Error %d\n", p, ret); goto out; } if (token < 1) { pr_warn("bad max cmd_per_lun parameter '%s'\n", p); ret = -EINVAL; goto out; } target->scsi_host->cmd_per_lun = token; break; case SRP_OPT_TARGET_CAN_QUEUE: ret = match_int(args, &token); if (ret) { pr_warn("match_int() failed for max target_can_queue parameter '%s', Error %d\n", p, ret); goto out; } if (token < 1) { pr_warn("bad max target_can_queue parameter '%s'\n", p); ret = -EINVAL; goto out; } target->target_can_queue = token; break; case SRP_OPT_IO_CLASS: ret = match_hex(args, &token); if (ret) { pr_warn("bad IO class parameter '%s'\n", p); goto out; } if (token != SRP_REV10_IB_IO_CLASS && token != SRP_REV16A_IB_IO_CLASS) { pr_warn("unknown IO class parameter value %x specified (use %x or %x).\n", token, SRP_REV10_IB_IO_CLASS, SRP_REV16A_IB_IO_CLASS); ret = -EINVAL; goto out; } target->io_class = token; break; case SRP_OPT_INITIATOR_EXT: p = match_strdup(args); if (!p) { ret = -ENOMEM; goto out; } ret = kstrtoull(p, 16, &ull); if (ret) { pr_warn("bad initiator_ext value '%s'\n", p); kfree(p); goto out; } target->initiator_ext = cpu_to_be64(ull); kfree(p); break; case SRP_OPT_CMD_SG_ENTRIES: ret = match_int(args, &token); if (ret) { pr_warn("match_int() failed for max cmd_sg_entries parameter '%s', Error %d\n", p, ret); goto out; } if (token < 1 || token > 255) { pr_warn("bad max cmd_sg_entries parameter '%s'\n", p); ret = -EINVAL; goto out; } target->cmd_sg_cnt = token; break; case SRP_OPT_ALLOW_EXT_SG: ret = match_int(args, &token); if (ret) { pr_warn("bad allow_ext_sg parameter '%s'\n", p); goto out; } target->allow_ext_sg = !!token; break; case SRP_OPT_SG_TABLESIZE: ret = match_int(args, &token); if (ret) { pr_warn("match_int() failed for max sg_tablesize parameter '%s', Error %d\n", p, ret); goto out; } if (token < 1 || token > SG_MAX_SEGMENTS) { pr_warn("bad max sg_tablesize parameter '%s'\n", p); ret = -EINVAL; goto out; } target->sg_tablesize = token; break; case SRP_OPT_COMP_VECTOR: ret = match_int(args, &token); if (ret) { pr_warn("match_int() failed for comp_vector parameter '%s', Error %d\n", p, ret); goto out; } if (token < 0) { pr_warn("bad comp_vector parameter '%s'\n", p); ret = -EINVAL; goto out; } target->comp_vector = token; break; case SRP_OPT_TL_RETRY_COUNT: ret = match_int(args, &token); if (ret) { pr_warn("match_int() failed for tl_retry_count parameter '%s', Error %d\n", p, ret); goto out; } if (token < 2 || token > 7) { pr_warn("bad tl_retry_count parameter '%s' (must be a number between 2 and 7)\n", p); ret = -EINVAL; goto out; } target->tl_retry_count = token; break; case SRP_OPT_MAX_IT_IU_SIZE: ret = match_int(args, &token); if (ret) { pr_warn("match_int() failed for max it_iu_size parameter '%s', Error %d\n", p, ret); goto out; } if (token < 0) { pr_warn("bad maximum initiator to target IU size '%s'\n", p); ret = -EINVAL; goto out; } target->max_it_iu_size = token; break; case SRP_OPT_CH_COUNT: ret = match_int(args, &token); if (ret) { pr_warn("match_int() failed for channel count parameter '%s', Error %d\n", p, ret); goto out; } if (token < 1) { pr_warn("bad channel count %s\n", p); ret = -EINVAL; goto out; } target->ch_count = token; break; default: pr_warn("unknown parameter or missing value '%s' in target creation request\n", p); ret = -EINVAL; goto out; } } for (i = 0; i < ARRAY_SIZE(srp_opt_mandatory); i++) { if ((opt_mask & srp_opt_mandatory[i]) == srp_opt_mandatory[i]) { ret = 0; break; } } if (ret) pr_warn("target creation request is missing one or more parameters\n"); if (target->scsi_host->cmd_per_lun > target->scsi_host->can_queue && (opt_mask & SRP_OPT_MAX_CMD_PER_LUN)) pr_warn("cmd_per_lun = %d > queue_size = %d\n", target->scsi_host->cmd_per_lun, target->scsi_host->can_queue); out: kfree(options); return ret; } static ssize_t add_target_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct srp_host *host = container_of(dev, struct srp_host, dev); struct Scsi_Host *target_host; struct srp_target_port *target; struct srp_rdma_ch *ch; struct srp_device *srp_dev = host->srp_dev; struct ib_device *ibdev = srp_dev->dev; int ret, i, ch_idx; unsigned int max_sectors_per_mr, mr_per_cmd = 0; bool multich = false; uint32_t max_iu_len; target_host = scsi_host_alloc(&srp_template, sizeof (struct srp_target_port)); if (!target_host) return -ENOMEM; target_host->transportt = ib_srp_transport_template; target_host->max_channel = 0; target_host->max_id = 1; target_host->max_lun = -1LL; target_host->max_cmd_len = sizeof ((struct srp_cmd *) (void *) 0L)->cdb; if (ibdev->attrs.kernel_cap_flags & IBK_SG_GAPS_REG) target_host->max_segment_size = ib_dma_max_seg_size(ibdev); else target_host->virt_boundary_mask = ~srp_dev->mr_page_mask; target = host_to_target(target_host); target->net = kobj_ns_grab_current(KOBJ_NS_TYPE_NET); target->io_class = SRP_REV16A_IB_IO_CLASS; target->scsi_host = target_host; target->srp_host = host; target->lkey = host->srp_dev->pd->local_dma_lkey; target->global_rkey = host->srp_dev->global_rkey; target->cmd_sg_cnt = cmd_sg_entries; target->sg_tablesize = indirect_sg_entries ? : cmd_sg_entries; target->allow_ext_sg = allow_ext_sg; target->tl_retry_count = 7; target->queue_size = SRP_DEFAULT_QUEUE_SIZE; /* * Avoid that the SCSI host can be removed by srp_remove_target() * before this function returns. */ scsi_host_get(target->scsi_host); ret = mutex_lock_interruptible(&host->add_target_mutex); if (ret < 0) goto put; ret = srp_parse_options(target->net, buf, target); if (ret) goto out; if (!srp_conn_unique(target->srp_host, target)) { if (target->using_rdma_cm) { shost_printk(KERN_INFO, target->scsi_host, PFX "Already connected to target port with id_ext=%016llx;ioc_guid=%016llx;dest=%pIS\n", be64_to_cpu(target->id_ext), be64_to_cpu(target->ioc_guid), &target->rdma_cm.dst); } else { shost_printk(KERN_INFO, target->scsi_host, PFX "Already connected to target port with id_ext=%016llx;ioc_guid=%016llx;initiator_ext=%016llx\n", be64_to_cpu(target->id_ext), be64_to_cpu(target->ioc_guid), be64_to_cpu(target->initiator_ext)); } ret = -EEXIST; goto out; } if (!srp_dev->has_fr && !target->allow_ext_sg && target->cmd_sg_cnt < target->sg_tablesize) { pr_warn("No MR pool and no external indirect descriptors, limiting sg_tablesize to cmd_sg_cnt\n"); target->sg_tablesize = target->cmd_sg_cnt; } if (srp_dev->use_fast_reg) { bool gaps_reg = ibdev->attrs.kernel_cap_flags & IBK_SG_GAPS_REG; max_sectors_per_mr = srp_dev->max_pages_per_mr << (ilog2(srp_dev->mr_page_size) - 9); if (!gaps_reg) { /* * FR can only map one HCA page per entry. If the start * address is not aligned on a HCA page boundary two * entries will be used for the head and the tail * although these two entries combined contain at most * one HCA page of data. Hence the "+ 1" in the * calculation below. * * The indirect data buffer descriptor is contiguous * so the memory for that buffer will only be * registered if register_always is true. Hence add * one to mr_per_cmd if register_always has been set. */ mr_per_cmd = register_always + (target->scsi_host->max_sectors + 1 + max_sectors_per_mr - 1) / max_sectors_per_mr; } else { mr_per_cmd = register_always + (target->sg_tablesize + srp_dev->max_pages_per_mr - 1) / srp_dev->max_pages_per_mr; } pr_debug("max_sectors = %u; max_pages_per_mr = %u; mr_page_size = %u; max_sectors_per_mr = %u; mr_per_cmd = %u\n", target->scsi_host->max_sectors, srp_dev->max_pages_per_mr, srp_dev->mr_page_size, max_sectors_per_mr, mr_per_cmd); } target_host->sg_tablesize = target->sg_tablesize; target->mr_pool_size = target->scsi_host->can_queue * mr_per_cmd; target->mr_per_cmd = mr_per_cmd; target->indirect_size = target->sg_tablesize * sizeof (struct srp_direct_buf); max_iu_len = srp_max_it_iu_len(target->cmd_sg_cnt, srp_use_imm_data, target->max_it_iu_size); INIT_WORK(&target->tl_err_work, srp_tl_err_work); INIT_WORK(&target->remove_work, srp_remove_work); spin_lock_init(&target->lock); ret = rdma_query_gid(ibdev, host->port, 0, &target->sgid); if (ret) goto out; ret = -ENOMEM; if (target->ch_count == 0) { target->ch_count = min(ch_count ?: max(4 * num_online_nodes(), ibdev->num_comp_vectors), num_online_cpus()); } target->ch = kcalloc(target->ch_count, sizeof(*target->ch), GFP_KERNEL); if (!target->ch) goto out; for (ch_idx = 0; ch_idx < target->ch_count; ++ch_idx) { ch = &target->ch[ch_idx]; ch->target = target; ch->comp_vector = ch_idx % ibdev->num_comp_vectors; spin_lock_init(&ch->lock); INIT_LIST_HEAD(&ch->free_tx); ret = srp_new_cm_id(ch); if (ret) goto err_disconnect; ret = srp_create_ch_ib(ch); if (ret) goto err_disconnect; ret = srp_connect_ch(ch, max_iu_len, multich); if (ret) { char dst[64]; if (target->using_rdma_cm) snprintf(dst, sizeof(dst), "%pIS", &target->rdma_cm.dst); else snprintf(dst, sizeof(dst), "%pI6", target->ib_cm.orig_dgid.raw); shost_printk(KERN_ERR, target->scsi_host, PFX "Connection %d/%d to %s failed\n", ch_idx, target->ch_count, dst); if (ch_idx == 0) { goto free_ch; } else { srp_free_ch_ib(target, ch); target->ch_count = ch - target->ch; goto connected; } } multich = true; } connected: target->scsi_host->nr_hw_queues = target->ch_count; ret = srp_add_target(host, target); if (ret) goto err_disconnect; if (target->state != SRP_TARGET_REMOVED) { if (target->using_rdma_cm) { shost_printk(KERN_DEBUG, target->scsi_host, PFX "new target: id_ext %016llx ioc_guid %016llx sgid %pI6 dest %pIS\n", be64_to_cpu(target->id_ext), be64_to_cpu(target->ioc_guid), target->sgid.raw, &target->rdma_cm.dst); } else { shost_printk(KERN_DEBUG, target->scsi_host, PFX "new target: id_ext %016llx ioc_guid %016llx pkey %04x service_id %016llx sgid %pI6 dgid %pI6\n", be64_to_cpu(target->id_ext), be64_to_cpu(target->ioc_guid), be16_to_cpu(target->ib_cm.pkey), be64_to_cpu(target->ib_cm.service_id), target->sgid.raw, target->ib_cm.orig_dgid.raw); } } ret = count; out: mutex_unlock(&host->add_target_mutex); put: scsi_host_put(target->scsi_host); if (ret < 0) { /* * If a call to srp_remove_target() has not been scheduled, * drop the network namespace reference now that was obtained * earlier in this function. */ if (target->state != SRP_TARGET_REMOVED) kobj_ns_drop(KOBJ_NS_TYPE_NET, target->net); scsi_host_put(target->scsi_host); } return ret; err_disconnect: srp_disconnect_target(target); free_ch: for (i = 0; i < target->ch_count; i++) { ch = &target->ch[i]; srp_free_ch_ib(target, ch); } kfree(target->ch); goto out; } static DEVICE_ATTR_WO(add_target); static ssize_t ibdev_show(struct device *dev, struct device_attribute *attr, char *buf) { struct srp_host *host = container_of(dev, struct srp_host, dev); return sysfs_emit(buf, "%s\n", dev_name(&host->srp_dev->dev->dev)); } static DEVICE_ATTR_RO(ibdev); static ssize_t port_show(struct device *dev, struct device_attribute *attr, char *buf) { struct srp_host *host = container_of(dev, struct srp_host, dev); return sysfs_emit(buf, "%u\n", host->port); } static DEVICE_ATTR_RO(port); static struct attribute *srp_class_attrs[] = { &dev_attr_add_target.attr, &dev_attr_ibdev.attr, &dev_attr_port.attr, NULL }; static struct srp_host *srp_add_port(struct srp_device *device, u32 port) { struct srp_host *host; host = kzalloc(sizeof *host, GFP_KERNEL); if (!host) return NULL; INIT_LIST_HEAD(&host->target_list); spin_lock_init(&host->target_lock); mutex_init(&host->add_target_mutex); host->srp_dev = device; host->port = port; device_initialize(&host->dev); host->dev.class = &srp_class; host->dev.parent = device->dev->dev.parent; if (dev_set_name(&host->dev, "srp-%s-%u", dev_name(&device->dev->dev), port)) goto put_host; if (device_add(&host->dev)) goto put_host; return host; put_host: put_device(&host->dev); return NULL; } static void srp_rename_dev(struct ib_device *device, void *client_data) { struct srp_device *srp_dev = client_data; struct srp_host *host, *tmp_host; list_for_each_entry_safe(host, tmp_host, &srp_dev->dev_list, list) { char name[IB_DEVICE_NAME_MAX + 8]; snprintf(name, sizeof(name), "srp-%s-%u", dev_name(&device->dev), host->port); device_rename(&host->dev, name); } } static int srp_add_one(struct ib_device *device) { struct srp_device *srp_dev; struct ib_device_attr *attr = &device->attrs; struct srp_host *host; int mr_page_shift; u32 p; u64 max_pages_per_mr; unsigned int flags = 0; srp_dev = kzalloc(sizeof(*srp_dev), GFP_KERNEL); if (!srp_dev) return -ENOMEM; /* * Use the smallest page size supported by the HCA, down to a * minimum of 4096 bytes. We're unlikely to build large sglists * out of smaller entries. */ mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); srp_dev->mr_page_size = 1 << mr_page_shift; srp_dev->mr_page_mask = ~((u64) srp_dev->mr_page_size - 1); max_pages_per_mr = attr->max_mr_size; do_div(max_pages_per_mr, srp_dev->mr_page_size); pr_debug("%s: %llu / %u = %llu <> %u\n", __func__, attr->max_mr_size, srp_dev->mr_page_size, max_pages_per_mr, SRP_MAX_PAGES_PER_MR); srp_dev->max_pages_per_mr = min_t(u64, SRP_MAX_PAGES_PER_MR, max_pages_per_mr); srp_dev->has_fr = (attr->device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS); if (!never_register && !srp_dev->has_fr) dev_warn(&device->dev, "FR is not supported\n"); else if (!never_register && attr->max_mr_size >= 2 * srp_dev->mr_page_size) srp_dev->use_fast_reg = srp_dev->has_fr; if (never_register || !register_always || !srp_dev->has_fr) flags |= IB_PD_UNSAFE_GLOBAL_RKEY; if (srp_dev->use_fast_reg) { srp_dev->max_pages_per_mr = min_t(u32, srp_dev->max_pages_per_mr, attr->max_fast_reg_page_list_len); } srp_dev->mr_max_size = srp_dev->mr_page_size * srp_dev->max_pages_per_mr; pr_debug("%s: mr_page_shift = %d, device->max_mr_size = %#llx, device->max_fast_reg_page_list_len = %u, max_pages_per_mr = %d, mr_max_size = %#x\n", dev_name(&device->dev), mr_page_shift, attr->max_mr_size, attr->max_fast_reg_page_list_len, srp_dev->max_pages_per_mr, srp_dev->mr_max_size); INIT_LIST_HEAD(&srp_dev->dev_list); srp_dev->dev = device; srp_dev->pd = ib_alloc_pd(device, flags); if (IS_ERR(srp_dev->pd)) { int ret = PTR_ERR(srp_dev->pd); kfree(srp_dev); return ret; } if (flags & IB_PD_UNSAFE_GLOBAL_RKEY) { srp_dev->global_rkey = srp_dev->pd->unsafe_global_rkey; WARN_ON_ONCE(srp_dev->global_rkey == 0); } rdma_for_each_port (device, p) { host = srp_add_port(srp_dev, p); if (host) list_add_tail(&host->list, &srp_dev->dev_list); } ib_set_client_data(device, &srp_client, srp_dev); return 0; } static void srp_remove_one(struct ib_device *device, void *client_data) { struct srp_device *srp_dev; struct srp_host *host, *tmp_host; struct srp_target_port *target; srp_dev = client_data; list_for_each_entry_safe(host, tmp_host, &srp_dev->dev_list, list) { /* * Remove the add_target sysfs entry so that no new target ports * can be created. */ device_del(&host->dev); /* * Remove all target ports. */ spin_lock(&host->target_lock); list_for_each_entry(target, &host->target_list, list) srp_queue_remove_work(target); spin_unlock(&host->target_lock); /* * srp_queue_remove_work() queues a call to * srp_remove_target(). The latter function cancels * target->tl_err_work so waiting for the remove works to * finish is sufficient. */ flush_workqueue(srp_remove_wq); put_device(&host->dev); } ib_dealloc_pd(srp_dev->pd); kfree(srp_dev); } static struct srp_function_template ib_srp_transport_functions = { .has_rport_state = true, .reset_timer_if_blocked = true, .reconnect_delay = &srp_reconnect_delay, .fast_io_fail_tmo = &srp_fast_io_fail_tmo, .dev_loss_tmo = &srp_dev_loss_tmo, .reconnect = srp_rport_reconnect, .rport_delete = srp_rport_delete, .terminate_rport_io = srp_terminate_io, }; static int __init srp_init_module(void) { int ret; BUILD_BUG_ON(sizeof(struct srp_aer_req) != 36); BUILD_BUG_ON(sizeof(struct srp_cmd) != 48); BUILD_BUG_ON(sizeof(struct srp_imm_buf) != 4); BUILD_BUG_ON(sizeof(struct srp_indirect_buf) != 20); BUILD_BUG_ON(sizeof(struct srp_login_req) != 64); BUILD_BUG_ON(sizeof(struct srp_login_req_rdma) != 56); BUILD_BUG_ON(sizeof(struct srp_rsp) != 36); if (srp_sg_tablesize) { pr_warn("srp_sg_tablesize is deprecated, please use cmd_sg_entries\n"); if (!cmd_sg_entries) cmd_sg_entries = srp_sg_tablesize; } if (!cmd_sg_entries) cmd_sg_entries = SRP_DEF_SG_TABLESIZE; if (cmd_sg_entries > 255) { pr_warn("Clamping cmd_sg_entries to 255\n"); cmd_sg_entries = 255; } if (!indirect_sg_entries) indirect_sg_entries = cmd_sg_entries; else if (indirect_sg_entries < cmd_sg_entries) { pr_warn("Bumping up indirect_sg_entries to match cmd_sg_entries (%u)\n", cmd_sg_entries); indirect_sg_entries = cmd_sg_entries; } if (indirect_sg_entries > SG_MAX_SEGMENTS) { pr_warn("Clamping indirect_sg_entries to %u\n", SG_MAX_SEGMENTS); indirect_sg_entries = SG_MAX_SEGMENTS; } srp_remove_wq = create_workqueue("srp_remove"); if (!srp_remove_wq) { ret = -ENOMEM; goto out; } ret = -ENOMEM; ib_srp_transport_template = srp_attach_transport(&ib_srp_transport_functions); if (!ib_srp_transport_template) goto destroy_wq; ret = class_register(&srp_class); if (ret) { pr_err("couldn't register class infiniband_srp\n"); goto release_tr; } ib_sa_register_client(&srp_sa_client); ret = ib_register_client(&srp_client); if (ret) { pr_err("couldn't register IB client\n"); goto unreg_sa; } out: return ret; unreg_sa: ib_sa_unregister_client(&srp_sa_client); class_unregister(&srp_class); release_tr: srp_release_transport(ib_srp_transport_template); destroy_wq: destroy_workqueue(srp_remove_wq); goto out; } static void __exit srp_cleanup_module(void) { ib_unregister_client(&srp_client); ib_sa_unregister_client(&srp_sa_client); class_unregister(&srp_class); srp_release_transport(ib_srp_transport_template); destroy_workqueue(srp_remove_wq); } module_init(srp_init_module); module_exit(srp_cleanup_module); |
| 7 1 7 24 24 24 2 2 2 74 74 2 2 2 3 12 1 11 29 10 10 20 20 1 22 5 20 1 19 19 7 17 35 35 25 29 23 20 1 20 4 3 16 1 1 22 6 17 26 6 65 65 65 65 6 62 5 4 7 55 57 57 57 23 19 2 2 2 1 55 55 2 48 48 50 23 7 6 27 17 2 27 1 9 30 4 1 5 20 25 2 2 1 1 5 2 2 2 2 1 1 2 1 1 7 1 1 5 1 1 2 1 1 23 22 3 23 9 3 5 5 2 3 9 4 4 4 4 1 3 4 4 1 1 1 2 24 1 13 1 4 1 1 1 4 71 71 70 71 73 1 72 73 2 71 73 73 74 1 73 40 34 73 1 73 73 22 12 10 2 19 328 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 | // SPDX-License-Identifier: GPL-2.0-only /* * Kernel Connection Multiplexor * * Copyright (c) 2016 Tom Herbert <tom@herbertland.com> */ #include <linux/bpf.h> #include <linux/errno.h> #include <linux/errqueue.h> #include <linux/file.h> #include <linux/filter.h> #include <linux/in.h> #include <linux/kernel.h> #include <linux/module.h> #include <linux/net.h> #include <linux/netdevice.h> #include <linux/poll.h> #include <linux/rculist.h> #include <linux/skbuff.h> #include <linux/socket.h> #include <linux/splice.h> #include <linux/uaccess.h> #include <linux/workqueue.h> #include <linux/syscalls.h> #include <linux/sched/signal.h> #include <net/kcm.h> #include <net/netns/generic.h> #include <net/sock.h> #include <uapi/linux/kcm.h> #include <trace/events/sock.h> unsigned int kcm_net_id; static struct kmem_cache *kcm_psockp __read_mostly; static struct kmem_cache *kcm_muxp __read_mostly; static struct workqueue_struct *kcm_wq; static inline struct kcm_sock *kcm_sk(const struct sock *sk) { return (struct kcm_sock *)sk; } static inline struct kcm_tx_msg *kcm_tx_msg(struct sk_buff *skb) { return (struct kcm_tx_msg *)skb->cb; } static void report_csk_error(struct sock *csk, int err) { csk->sk_err = EPIPE; sk_error_report(csk); } static void kcm_abort_tx_psock(struct kcm_psock *psock, int err, bool wakeup_kcm) { struct sock *csk = psock->sk; struct kcm_mux *mux = psock->mux; /* Unrecoverable error in transmit */ spin_lock_bh(&mux->lock); if (psock->tx_stopped) { spin_unlock_bh(&mux->lock); return; } psock->tx_stopped = 1; KCM_STATS_INCR(psock->stats.tx_aborts); if (!psock->tx_kcm) { /* Take off psocks_avail list */ list_del(&psock->psock_avail_list); } else if (wakeup_kcm) { /* In this case psock is being aborted while outside of * write_msgs and psock is reserved. Schedule tx_work * to handle the failure there. Need to commit tx_stopped * before queuing work. */ smp_mb(); queue_work(kcm_wq, &psock->tx_kcm->tx_work); } spin_unlock_bh(&mux->lock); /* Report error on lower socket */ report_csk_error(csk, err); } /* RX mux lock held. */ static void kcm_update_rx_mux_stats(struct kcm_mux *mux, struct kcm_psock *psock) { STRP_STATS_ADD(mux->stats.rx_bytes, psock->strp.stats.bytes - psock->saved_rx_bytes); mux->stats.rx_msgs += psock->strp.stats.msgs - psock->saved_rx_msgs; psock->saved_rx_msgs = psock->strp.stats.msgs; psock->saved_rx_bytes = psock->strp.stats.bytes; } static void kcm_update_tx_mux_stats(struct kcm_mux *mux, struct kcm_psock *psock) { KCM_STATS_ADD(mux->stats.tx_bytes, psock->stats.tx_bytes - psock->saved_tx_bytes); mux->stats.tx_msgs += psock->stats.tx_msgs - psock->saved_tx_msgs; psock->saved_tx_msgs = psock->stats.tx_msgs; psock->saved_tx_bytes = psock->stats.tx_bytes; } static int kcm_queue_rcv_skb(struct sock *sk, struct sk_buff *skb); /* KCM is ready to receive messages on its queue-- either the KCM is new or * has become unblocked after being blocked on full socket buffer. Queue any * pending ready messages on a psock. RX mux lock held. */ static void kcm_rcv_ready(struct kcm_sock *kcm) { struct kcm_mux *mux = kcm->mux; struct kcm_psock *psock; struct sk_buff *skb; if (unlikely(kcm->rx_wait || kcm->rx_psock || kcm->rx_disabled)) return; while (unlikely((skb = __skb_dequeue(&mux->rx_hold_queue)))) { if (kcm_queue_rcv_skb(&kcm->sk, skb)) { /* Assuming buffer limit has been reached */ skb_queue_head(&mux->rx_hold_queue, skb); WARN_ON(!sk_rmem_alloc_get(&kcm->sk)); return; } } while (!list_empty(&mux->psocks_ready)) { psock = list_first_entry(&mux->psocks_ready, struct kcm_psock, psock_ready_list); if (kcm_queue_rcv_skb(&kcm->sk, psock->ready_rx_msg)) { /* Assuming buffer limit has been reached */ WARN_ON(!sk_rmem_alloc_get(&kcm->sk)); return; } /* Consumed the ready message on the psock. Schedule rx_work to * get more messages. */ list_del(&psock->psock_ready_list); psock->ready_rx_msg = NULL; /* Commit clearing of ready_rx_msg for queuing work */ smp_mb(); strp_unpause(&psock->strp); strp_check_rcv(&psock->strp); } /* Buffer limit is okay now, add to ready list */ list_add_tail(&kcm->wait_rx_list, &kcm->mux->kcm_rx_waiters); /* paired with lockless reads in kcm_rfree() */ WRITE_ONCE(kcm->rx_wait, true); } static void kcm_rfree(struct sk_buff *skb) { struct sock *sk = skb->sk; struct kcm_sock *kcm = kcm_sk(sk); struct kcm_mux *mux = kcm->mux; unsigned int len = skb->truesize; sk_mem_uncharge(sk, len); atomic_sub(len, &sk->sk_rmem_alloc); /* For reading rx_wait and rx_psock without holding lock */ smp_mb__after_atomic(); if (!READ_ONCE(kcm->rx_wait) && !READ_ONCE(kcm->rx_psock) && sk_rmem_alloc_get(sk) < sk->sk_rcvlowat) { spin_lock_bh(&mux->rx_lock); kcm_rcv_ready(kcm); spin_unlock_bh(&mux->rx_lock); } } static int kcm_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) { struct sk_buff_head *list = &sk->sk_receive_queue; if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf) return -ENOMEM; if (!sk_rmem_schedule(sk, skb, skb->truesize)) return -ENOBUFS; skb->dev = NULL; skb_orphan(skb); skb->sk = sk; skb->destructor = kcm_rfree; atomic_add(skb->truesize, &sk->sk_rmem_alloc); sk_mem_charge(sk, skb->truesize); skb_queue_tail(list, skb); if (!sock_flag(sk, SOCK_DEAD)) sk->sk_data_ready(sk); return 0; } /* Requeue received messages for a kcm socket to other kcm sockets. This is * called with a kcm socket is receive disabled. * RX mux lock held. */ static void requeue_rx_msgs(struct kcm_mux *mux, struct sk_buff_head *head) { struct sk_buff *skb; struct kcm_sock *kcm; while ((skb = skb_dequeue(head))) { /* Reset destructor to avoid calling kcm_rcv_ready */ skb->destructor = sock_rfree; skb_orphan(skb); try_again: if (list_empty(&mux->kcm_rx_waiters)) { skb_queue_tail(&mux->rx_hold_queue, skb); continue; } kcm = list_first_entry(&mux->kcm_rx_waiters, struct kcm_sock, wait_rx_list); if (kcm_queue_rcv_skb(&kcm->sk, skb)) { /* Should mean socket buffer full */ list_del(&kcm->wait_rx_list); /* paired with lockless reads in kcm_rfree() */ WRITE_ONCE(kcm->rx_wait, false); /* Commit rx_wait to read in kcm_free */ smp_wmb(); goto try_again; } } } /* Lower sock lock held */ static struct kcm_sock *reserve_rx_kcm(struct kcm_psock *psock, struct sk_buff *head) { struct kcm_mux *mux = psock->mux; struct kcm_sock *kcm; WARN_ON(psock->ready_rx_msg); if (psock->rx_kcm) return psock->rx_kcm; spin_lock_bh(&mux->rx_lock); if (psock->rx_kcm) { spin_unlock_bh(&mux->rx_lock); return psock->rx_kcm; } kcm_update_rx_mux_stats(mux, psock); if (list_empty(&mux->kcm_rx_waiters)) { psock->ready_rx_msg = head; strp_pause(&psock->strp); list_add_tail(&psock->psock_ready_list, &mux->psocks_ready); spin_unlock_bh(&mux->rx_lock); return NULL; } kcm = list_first_entry(&mux->kcm_rx_waiters, struct kcm_sock, wait_rx_list); list_del(&kcm->wait_rx_list); /* paired with lockless reads in kcm_rfree() */ WRITE_ONCE(kcm->rx_wait, false); psock->rx_kcm = kcm; /* paired with lockless reads in kcm_rfree() */ WRITE_ONCE(kcm->rx_psock, psock); spin_unlock_bh(&mux->rx_lock); return kcm; } static void kcm_done(struct kcm_sock *kcm); static void kcm_done_work(struct work_struct *w) { kcm_done(container_of(w, struct kcm_sock, done_work)); } /* Lower sock held */ static void unreserve_rx_kcm(struct kcm_psock *psock, bool rcv_ready) { struct kcm_sock *kcm = psock->rx_kcm; struct kcm_mux *mux = psock->mux; if (!kcm) return; spin_lock_bh(&mux->rx_lock); psock->rx_kcm = NULL; /* paired with lockless reads in kcm_rfree() */ WRITE_ONCE(kcm->rx_psock, NULL); /* Commit kcm->rx_psock before sk_rmem_alloc_get to sync with * kcm_rfree */ smp_mb(); if (unlikely(kcm->done)) { spin_unlock_bh(&mux->rx_lock); /* Need to run kcm_done in a task since we need to qcquire * callback locks which may already be held here. */ INIT_WORK(&kcm->done_work, kcm_done_work); schedule_work(&kcm->done_work); return; } if (unlikely(kcm->rx_disabled)) { requeue_rx_msgs(mux, &kcm->sk.sk_receive_queue); } else if (rcv_ready || unlikely(!sk_rmem_alloc_get(&kcm->sk))) { /* Check for degenerative race with rx_wait that all * data was dequeued (accounted for in kcm_rfree). */ kcm_rcv_ready(kcm); } spin_unlock_bh(&mux->rx_lock); } /* Lower sock lock held */ static void psock_data_ready(struct sock *sk) { struct kcm_psock *psock; trace_sk_data_ready(sk); read_lock_bh(&sk->sk_callback_lock); psock = (struct kcm_psock *)sk->sk_user_data; if (likely(psock)) strp_data_ready(&psock->strp); read_unlock_bh(&sk->sk_callback_lock); } /* Called with lower sock held */ static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff *skb) { struct kcm_psock *psock = container_of(strp, struct kcm_psock, strp); struct kcm_sock *kcm; try_queue: kcm = reserve_rx_kcm(psock, skb); if (!kcm) { /* Unable to reserve a KCM, message is held in psock and strp * is paused. */ return; } if (kcm_queue_rcv_skb(&kcm->sk, skb)) { /* Should mean socket buffer full */ unreserve_rx_kcm(psock, false); goto try_queue; } } static int kcm_parse_func_strparser(struct strparser *strp, struct sk_buff *skb) { struct kcm_psock *psock = container_of(strp, struct kcm_psock, strp); struct bpf_prog *prog = psock->bpf_prog; int res; res = bpf_prog_run_pin_on_cpu(prog, skb); return res; } static int kcm_read_sock_done(struct strparser *strp, int err) { struct kcm_psock *psock = container_of(strp, struct kcm_psock, strp); unreserve_rx_kcm(psock, true); return err; } static void psock_state_change(struct sock *sk) { /* TCP only does a EPOLLIN for a half close. Do a EPOLLHUP here * since application will normally not poll with EPOLLIN * on the TCP sockets. */ report_csk_error(sk, EPIPE); } static void psock_write_space(struct sock *sk) { struct kcm_psock *psock; struct kcm_mux *mux; struct kcm_sock *kcm; read_lock_bh(&sk->sk_callback_lock); psock = (struct kcm_psock *)sk->sk_user_data; if (unlikely(!psock)) goto out; mux = psock->mux; spin_lock_bh(&mux->lock); /* Check if the socket is reserved so someone is waiting for sending. */ kcm = psock->tx_kcm; if (kcm) queue_work(kcm_wq, &kcm->tx_work); spin_unlock_bh(&mux->lock); out: read_unlock_bh(&sk->sk_callback_lock); } static void unreserve_psock(struct kcm_sock *kcm); /* kcm sock is locked. */ static struct kcm_psock *reserve_psock(struct kcm_sock *kcm) { struct kcm_mux *mux = kcm->mux; struct kcm_psock *psock; psock = kcm->tx_psock; smp_rmb(); /* Must read tx_psock before tx_wait */ if (psock) { WARN_ON(kcm->tx_wait); if (unlikely(psock->tx_stopped)) unreserve_psock(kcm); else return kcm->tx_psock; } spin_lock_bh(&mux->lock); /* Check again under lock to see if psock was reserved for this * psock via psock_unreserve. */ psock = kcm->tx_psock; if (unlikely(psock)) { WARN_ON(kcm->tx_wait); spin_unlock_bh(&mux->lock); return kcm->tx_psock; } if (!list_empty(&mux->psocks_avail)) { psock = list_first_entry(&mux->psocks_avail, struct kcm_psock, psock_avail_list); list_del(&psock->psock_avail_list); if (kcm->tx_wait) { list_del(&kcm->wait_psock_list); kcm->tx_wait = false; } kcm->tx_psock = psock; psock->tx_kcm = kcm; KCM_STATS_INCR(psock->stats.reserved); } else if (!kcm->tx_wait) { list_add_tail(&kcm->wait_psock_list, &mux->kcm_tx_waiters); kcm->tx_wait = true; } spin_unlock_bh(&mux->lock); return psock; } /* mux lock held */ static void psock_now_avail(struct kcm_psock *psock) { struct kcm_mux *mux = psock->mux; struct kcm_sock *kcm; if (list_empty(&mux->kcm_tx_waiters)) { list_add_tail(&psock->psock_avail_list, &mux->psocks_avail); } else { kcm = list_first_entry(&mux->kcm_tx_waiters, struct kcm_sock, wait_psock_list); list_del(&kcm->wait_psock_list); kcm->tx_wait = false; psock->tx_kcm = kcm; /* Commit before changing tx_psock since that is read in * reserve_psock before queuing work. */ smp_mb(); kcm->tx_psock = psock; KCM_STATS_INCR(psock->stats.reserved); queue_work(kcm_wq, &kcm->tx_work); } } /* kcm sock is locked. */ static void unreserve_psock(struct kcm_sock *kcm) { struct kcm_psock *psock; struct kcm_mux *mux = kcm->mux; spin_lock_bh(&mux->lock); psock = kcm->tx_psock; if (WARN_ON(!psock)) { spin_unlock_bh(&mux->lock); return; } smp_rmb(); /* Read tx_psock before tx_wait */ kcm_update_tx_mux_stats(mux, psock); WARN_ON(kcm->tx_wait); kcm->tx_psock = NULL; psock->tx_kcm = NULL; KCM_STATS_INCR(psock->stats.unreserved); if (unlikely(psock->tx_stopped)) { if (psock->done) { /* Deferred free */ list_del(&psock->psock_list); mux->psocks_cnt--; sock_put(psock->sk); fput(psock->sk->sk_socket->file); kmem_cache_free(kcm_psockp, psock); } /* Don't put back on available list */ spin_unlock_bh(&mux->lock); return; } psock_now_avail(psock); spin_unlock_bh(&mux->lock); } static void kcm_report_tx_retry(struct kcm_sock *kcm) { struct kcm_mux *mux = kcm->mux; spin_lock_bh(&mux->lock); KCM_STATS_INCR(mux->stats.tx_retries); spin_unlock_bh(&mux->lock); } /* Write any messages ready on the kcm socket. Called with kcm sock lock * held. Return bytes actually sent or error. */ static int kcm_write_msgs(struct kcm_sock *kcm) { unsigned int total_sent = 0; struct sock *sk = &kcm->sk; struct kcm_psock *psock; struct sk_buff *head; int ret = 0; kcm->tx_wait_more = false; psock = kcm->tx_psock; if (unlikely(psock && psock->tx_stopped)) { /* A reserved psock was aborted asynchronously. Unreserve * it and we'll retry the message. */ unreserve_psock(kcm); kcm_report_tx_retry(kcm); if (skb_queue_empty(&sk->sk_write_queue)) return 0; kcm_tx_msg(skb_peek(&sk->sk_write_queue))->started_tx = false; } retry: while ((head = skb_peek(&sk->sk_write_queue))) { struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES, }; struct kcm_tx_msg *txm = kcm_tx_msg(head); struct sk_buff *skb; unsigned int msize; int i; if (!txm->started_tx) { psock = reserve_psock(kcm); if (!psock) goto out; skb = head; txm->frag_offset = 0; txm->sent = 0; txm->started_tx = true; } else { if (WARN_ON(!psock)) { ret = -EINVAL; goto out; } skb = txm->frag_skb; } if (WARN_ON(!skb_shinfo(skb)->nr_frags) || WARN_ON_ONCE(!skb_frag_page(&skb_shinfo(skb)->frags[0]))) { ret = -EINVAL; goto out; } msize = 0; for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) msize += skb_frag_size(&skb_shinfo(skb)->frags[i]); iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, (const struct bio_vec *)skb_shinfo(skb)->frags, skb_shinfo(skb)->nr_frags, msize); iov_iter_advance(&msg.msg_iter, txm->frag_offset); do { ret = sock_sendmsg(psock->sk->sk_socket, &msg); if (ret <= 0) { if (ret == -EAGAIN) { /* Save state to try again when there's * write space on the socket */ txm->frag_skb = skb; ret = 0; goto out; } /* Hard failure in sending message, abort this * psock since it has lost framing * synchronization and retry sending the * message from the beginning. */ kcm_abort_tx_psock(psock, ret ? -ret : EPIPE, true); unreserve_psock(kcm); psock = NULL; txm->started_tx = false; kcm_report_tx_retry(kcm); ret = 0; goto retry; } txm->sent += ret; txm->frag_offset += ret; KCM_STATS_ADD(psock->stats.tx_bytes, ret); } while (msg.msg_iter.count > 0); if (skb == head) { if (skb_has_frag_list(skb)) { txm->frag_skb = skb_shinfo(skb)->frag_list; txm->frag_offset = 0; continue; } } else if (skb->next) { txm->frag_skb = skb->next; txm->frag_offset = 0; continue; } /* Successfully sent the whole packet, account for it. */ sk->sk_wmem_queued -= txm->sent; total_sent += txm->sent; skb_dequeue(&sk->sk_write_queue); kfree_skb(head); KCM_STATS_INCR(psock->stats.tx_msgs); } out: if (!head) { /* Done with all queued messages. */ WARN_ON(!skb_queue_empty(&sk->sk_write_queue)); if (psock) unreserve_psock(kcm); } /* Check if write space is available */ sk->sk_write_space(sk); return total_sent ? : ret; } static void kcm_tx_work(struct work_struct *w) { struct kcm_sock *kcm = container_of(w, struct kcm_sock, tx_work); struct sock *sk = &kcm->sk; int err; lock_sock(sk); /* Primarily for SOCK_DGRAM sockets, also handle asynchronous tx * aborts */ err = kcm_write_msgs(kcm); if (err < 0) { /* Hard failure in write, report error on KCM socket */ pr_warn("KCM: Hard failure on kcm_write_msgs %d\n", err); report_csk_error(&kcm->sk, -err); goto out; } /* Primarily for SOCK_SEQPACKET sockets */ if (likely(sk->sk_socket) && test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) { clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags); sk->sk_write_space(sk); } out: release_sock(sk); } static void kcm_push(struct kcm_sock *kcm) { if (kcm->tx_wait_more) kcm_write_msgs(kcm); } static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len) { struct sock *sk = sock->sk; struct kcm_sock *kcm = kcm_sk(sk); struct sk_buff *skb = NULL, *head = NULL; size_t copy, copied = 0; long timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT); int eor = (sock->type == SOCK_DGRAM) ? !(msg->msg_flags & MSG_MORE) : !!(msg->msg_flags & MSG_EOR); int err = -EPIPE; mutex_lock(&kcm->tx_mutex); lock_sock(sk); /* Per tcp_sendmsg this should be in poll */ sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk); if (sk->sk_err) goto out_error; if (kcm->seq_skb) { /* Previously opened message */ head = kcm->seq_skb; skb = kcm_tx_msg(head)->last_skb; goto start; } /* Call the sk_stream functions to manage the sndbuf mem. */ if (!sk_stream_memory_free(sk)) { kcm_push(kcm); set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); err = sk_stream_wait_memory(sk, &timeo); if (err) goto out_error; } if (msg_data_left(msg)) { /* New message, alloc head skb */ head = alloc_skb(0, sk->sk_allocation); while (!head) { kcm_push(kcm); err = sk_stream_wait_memory(sk, &timeo); if (err) goto out_error; head = alloc_skb(0, sk->sk_allocation); } skb = head; /* Set ip_summed to CHECKSUM_UNNECESSARY to avoid calling * csum_and_copy_from_iter from skb_do_copy_data_nocache. */ skb->ip_summed = CHECKSUM_UNNECESSARY; } start: while (msg_data_left(msg)) { bool merge = true; int i = skb_shinfo(skb)->nr_frags; struct page_frag *pfrag = sk_page_frag(sk); if (!sk_page_frag_refill(sk, pfrag)) goto wait_for_memory; if (!skb_can_coalesce(skb, i, pfrag->page, pfrag->offset)) { if (i == MAX_SKB_FRAGS) { struct sk_buff *tskb; tskb = alloc_skb(0, sk->sk_allocation); if (!tskb) goto wait_for_memory; if (head == skb) skb_shinfo(head)->frag_list = tskb; else skb->next = tskb; skb = tskb; skb->ip_summed = CHECKSUM_UNNECESSARY; continue; } merge = false; } if (msg->msg_flags & MSG_SPLICE_PAGES) { copy = msg_data_left(msg); if (!sk_wmem_schedule(sk, copy)) goto wait_for_memory; err = skb_splice_from_iter(skb, &msg->msg_iter, copy); if (err < 0) { if (err == -EMSGSIZE) goto wait_for_memory; goto out_error; } copy = err; skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG; sk_wmem_queued_add(sk, copy); sk_mem_charge(sk, copy); if (head != skb) head->truesize += copy; } else { copy = min_t(int, msg_data_left(msg), pfrag->size - pfrag->offset); if (!sk_wmem_schedule(sk, copy)) goto wait_for_memory; err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb, pfrag->page, pfrag->offset, copy); if (err) goto out_error; /* Update the skb. */ if (merge) { skb_frag_size_add( &skb_shinfo(skb)->frags[i - 1], copy); } else { skb_fill_page_desc(skb, i, pfrag->page, pfrag->offset, copy); get_page(pfrag->page); } pfrag->offset += copy; } copied += copy; if (head != skb) { head->len += copy; head->data_len += copy; } continue; wait_for_memory: kcm_push(kcm); err = sk_stream_wait_memory(sk, &timeo); if (err) goto out_error; } if (eor) { bool not_busy = skb_queue_empty(&sk->sk_write_queue); if (head) { /* Message complete, queue it on send buffer */ __skb_queue_tail(&sk->sk_write_queue, head); kcm->seq_skb = NULL; KCM_STATS_INCR(kcm->stats.tx_msgs); } if (msg->msg_flags & MSG_BATCH) { kcm->tx_wait_more = true; } else if (kcm->tx_wait_more || not_busy) { err = kcm_write_msgs(kcm); if (err < 0) { /* We got a hard error in write_msgs but have * already queued this message. Report an error * in the socket, but don't affect return value * from sendmsg */ pr_warn("KCM: Hard failure on kcm_write_msgs\n"); report_csk_error(&kcm->sk, -err); } } } else { /* Message not complete, save state */ partial_message: if (head) { kcm->seq_skb = head; kcm_tx_msg(head)->last_skb = skb; } } KCM_STATS_ADD(kcm->stats.tx_bytes, copied); release_sock(sk); mutex_unlock(&kcm->tx_mutex); return copied; out_error: kcm_push(kcm); if (sock->type == SOCK_SEQPACKET) { /* Wrote some bytes before encountering an * error, return partial success. */ if (copied) goto partial_message; if (head != kcm->seq_skb) kfree_skb(head); } else { kfree_skb(head); kcm->seq_skb = NULL; } err = sk_stream_error(sk, msg->msg_flags, err); /* make sure we wake any epoll edge trigger waiter */ if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 && err == -EAGAIN)) sk->sk_write_space(sk); release_sock(sk); mutex_unlock(&kcm->tx_mutex); return err; } static void kcm_splice_eof(struct socket *sock) { struct sock *sk = sock->sk; struct kcm_sock *kcm = kcm_sk(sk); if (skb_queue_empty_lockless(&sk->sk_write_queue)) return; lock_sock(sk); kcm_write_msgs(kcm); release_sock(sk); } static int kcm_recvmsg(struct socket *sock, struct msghdr *msg, size_t len, int flags) { struct sock *sk = sock->sk; struct kcm_sock *kcm = kcm_sk(sk); int err = 0; struct strp_msg *stm; int copied = 0; struct sk_buff *skb; skb = skb_recv_datagram(sk, flags, &err); if (!skb) goto out; /* Okay, have a message on the receive queue */ stm = strp_msg(skb); if (len > stm->full_len) len = stm->full_len; err = skb_copy_datagram_msg(skb, stm->offset, msg, len); if (err < 0) goto out; copied = len; if (likely(!(flags & MSG_PEEK))) { KCM_STATS_ADD(kcm->stats.rx_bytes, copied); if (copied < stm->full_len) { if (sock->type == SOCK_DGRAM) { /* Truncated message */ msg->msg_flags |= MSG_TRUNC; goto msg_finished; } stm->offset += copied; stm->full_len -= copied; } else { msg_finished: /* Finished with message */ msg->msg_flags |= MSG_EOR; KCM_STATS_INCR(kcm->stats.rx_msgs); } } out: skb_free_datagram(sk, skb); return copied ? : err; } static ssize_t kcm_splice_read(struct socket *sock, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags) { struct sock *sk = sock->sk; struct kcm_sock *kcm = kcm_sk(sk); struct strp_msg *stm; int err = 0; ssize_t copied; struct sk_buff *skb; if (sock->file->f_flags & O_NONBLOCK || flags & SPLICE_F_NONBLOCK) flags = MSG_DONTWAIT; else flags = 0; /* Only support splice for SOCKSEQPACKET */ skb = skb_recv_datagram(sk, flags, &err); if (!skb) goto err_out; /* Okay, have a message on the receive queue */ stm = strp_msg(skb); if (len > stm->full_len) len = stm->full_len; copied = skb_splice_bits(skb, sk, stm->offset, pipe, len, flags); if (copied < 0) { err = copied; goto err_out; } KCM_STATS_ADD(kcm->stats.rx_bytes, copied); stm->offset += copied; stm->full_len -= copied; /* We have no way to return MSG_EOR. If all the bytes have been * read we still leave the message in the receive socket buffer. * A subsequent recvmsg needs to be done to return MSG_EOR and * finish reading the message. */ skb_free_datagram(sk, skb); return copied; err_out: skb_free_datagram(sk, skb); return err; } /* kcm sock lock held */ static void kcm_recv_disable(struct kcm_sock *kcm) { struct kcm_mux *mux = kcm->mux; if (kcm->rx_disabled) return; spin_lock_bh(&mux->rx_lock); kcm->rx_disabled = 1; /* If a psock is reserved we'll do cleanup in unreserve */ if (!kcm->rx_psock) { if (kcm->rx_wait) { list_del(&kcm->wait_rx_list); /* paired with lockless reads in kcm_rfree() */ WRITE_ONCE(kcm->rx_wait, false); } requeue_rx_msgs(mux, &kcm->sk.sk_receive_queue); } spin_unlock_bh(&mux->rx_lock); } /* kcm sock lock held */ static void kcm_recv_enable(struct kcm_sock *kcm) { struct kcm_mux *mux = kcm->mux; if (!kcm->rx_disabled) return; spin_lock_bh(&mux->rx_lock); kcm->rx_disabled = 0; kcm_rcv_ready(kcm); spin_unlock_bh(&mux->rx_lock); } static int kcm_setsockopt(struct socket *sock, int level, int optname, sockptr_t optval, unsigned int optlen) { struct kcm_sock *kcm = kcm_sk(sock->sk); int val, valbool; int err = 0; if (level != SOL_KCM) return -ENOPROTOOPT; if (optlen < sizeof(int)) return -EINVAL; if (copy_from_sockptr(&val, optval, sizeof(int))) return -EFAULT; valbool = val ? 1 : 0; switch (optname) { case KCM_RECV_DISABLE: lock_sock(&kcm->sk); if (valbool) kcm_recv_disable(kcm); else kcm_recv_enable(kcm); release_sock(&kcm->sk); break; default: err = -ENOPROTOOPT; } return err; } static int kcm_getsockopt(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen) { struct kcm_sock *kcm = kcm_sk(sock->sk); int val, len; if (level != SOL_KCM) return -ENOPROTOOPT; if (get_user(len, optlen)) return -EFAULT; if (len < 0) return -EINVAL; len = min_t(unsigned int, len, sizeof(int)); switch (optname) { case KCM_RECV_DISABLE: val = kcm->rx_disabled; break; default: return -ENOPROTOOPT; } if (put_user(len, optlen)) return -EFAULT; if (copy_to_user(optval, &val, len)) return -EFAULT; return 0; } static void init_kcm_sock(struct kcm_sock *kcm, struct kcm_mux *mux) { struct kcm_sock *tkcm; struct list_head *head; int index = 0; /* For SOCK_SEQPACKET sock type, datagram_poll checks the sk_state, so * we set sk_state, otherwise epoll_wait always returns right away with * EPOLLHUP */ kcm->sk.sk_state = TCP_ESTABLISHED; /* Add to mux's kcm sockets list */ kcm->mux = mux; spin_lock_bh(&mux->lock); head = &mux->kcm_socks; list_for_each_entry(tkcm, &mux->kcm_socks, kcm_sock_list) { if (tkcm->index != index) break; head = &tkcm->kcm_sock_list; index++; } list_add(&kcm->kcm_sock_list, head); kcm->index = index; mux->kcm_socks_cnt++; spin_unlock_bh(&mux->lock); INIT_WORK(&kcm->tx_work, kcm_tx_work); mutex_init(&kcm->tx_mutex); spin_lock_bh(&mux->rx_lock); kcm_rcv_ready(kcm); spin_unlock_bh(&mux->rx_lock); } static int kcm_attach(struct socket *sock, struct socket *csock, struct bpf_prog *prog) { struct kcm_sock *kcm = kcm_sk(sock->sk); struct kcm_mux *mux = kcm->mux; struct sock *csk; struct kcm_psock *psock = NULL, *tpsock; struct list_head *head; int index = 0; static const struct strp_callbacks cb = { .rcv_msg = kcm_rcv_strparser, .parse_msg = kcm_parse_func_strparser, .read_sock_done = kcm_read_sock_done, }; int err = 0; csk = csock->sk; if (!csk) return -EINVAL; lock_sock(csk); /* Only allow TCP sockets to be attached for now */ if ((csk->sk_family != AF_INET && csk->sk_family != AF_INET6) || csk->sk_protocol != IPPROTO_TCP) { err = -EOPNOTSUPP; goto out; } /* Don't allow listeners or closed sockets */ if (csk->sk_state == TCP_LISTEN || csk->sk_state == TCP_CLOSE) { err = -EOPNOTSUPP; goto out; } psock = kmem_cache_zalloc(kcm_psockp, GFP_KERNEL); if (!psock) { err = -ENOMEM; goto out; } psock->mux = mux; psock->sk = csk; psock->bpf_prog = prog; write_lock_bh(&csk->sk_callback_lock); /* Check if sk_user_data is already by KCM or someone else. * Must be done under lock to prevent race conditions. */ if (csk->sk_user_data) { write_unlock_bh(&csk->sk_callback_lock); kmem_cache_free(kcm_psockp, psock); err = -EALREADY; goto out; } err = strp_init(&psock->strp, csk, &cb); if (err) { write_unlock_bh(&csk->sk_callback_lock); kmem_cache_free(kcm_psockp, psock); goto out; } psock->save_data_ready = csk->sk_data_ready; psock->save_write_space = csk->sk_write_space; psock->save_state_change = csk->sk_state_change; csk->sk_user_data = psock; csk->sk_data_ready = psock_data_ready; csk->sk_write_space = psock_write_space; csk->sk_state_change = psock_state_change; write_unlock_bh(&csk->sk_callback_lock); sock_hold(csk); /* Finished initialization, now add the psock to the MUX. */ spin_lock_bh(&mux->lock); head = &mux->psocks; list_for_each_entry(tpsock, &mux->psocks, psock_list) { if (tpsock->index != index) break; head = &tpsock->psock_list; index++; } list_add(&psock->psock_list, head); psock->index = index; KCM_STATS_INCR(mux->stats.psock_attach); mux->psocks_cnt++; psock_now_avail(psock); spin_unlock_bh(&mux->lock); /* Schedule RX work in case there are already bytes queued */ strp_check_rcv(&psock->strp); out: release_sock(csk); return err; } static int kcm_attach_ioctl(struct socket *sock, struct kcm_attach *info) { struct socket *csock; struct bpf_prog *prog; int err; csock = sockfd_lookup(info->fd, &err); if (!csock) return -ENOENT; prog = bpf_prog_get_type(info->bpf_fd, BPF_PROG_TYPE_SOCKET_FILTER); if (IS_ERR(prog)) { err = PTR_ERR(prog); goto out; } err = kcm_attach(sock, csock, prog); if (err) { bpf_prog_put(prog); goto out; } /* Keep reference on file also */ return 0; out: sockfd_put(csock); return err; } static void kcm_unattach(struct kcm_psock *psock) { struct sock *csk = psock->sk; struct kcm_mux *mux = psock->mux; lock_sock(csk); /* Stop getting callbacks from TCP socket. After this there should * be no way to reserve a kcm for this psock. */ write_lock_bh(&csk->sk_callback_lock); csk->sk_user_data = NULL; csk->sk_data_ready = psock->save_data_ready; csk->sk_write_space = psock->save_write_space; csk->sk_state_change = psock->save_state_change; strp_stop(&psock->strp); if (WARN_ON(psock->rx_kcm)) { write_unlock_bh(&csk->sk_callback_lock); release_sock(csk); return; } spin_lock_bh(&mux->rx_lock); /* Stop receiver activities. After this point psock should not be * able to get onto ready list either through callbacks or work. */ if (psock->ready_rx_msg) { list_del(&psock->psock_ready_list); kfree_skb(psock->ready_rx_msg); psock->ready_rx_msg = NULL; KCM_STATS_INCR(mux->stats.rx_ready_drops); } spin_unlock_bh(&mux->rx_lock); write_unlock_bh(&csk->sk_callback_lock); /* Call strp_done without sock lock */ release_sock(csk); strp_done(&psock->strp); lock_sock(csk); bpf_prog_put(psock->bpf_prog); spin_lock_bh(&mux->lock); aggregate_psock_stats(&psock->stats, &mux->aggregate_psock_stats); save_strp_stats(&psock->strp, &mux->aggregate_strp_stats); KCM_STATS_INCR(mux->stats.psock_unattach); if (psock->tx_kcm) { /* psock was reserved. Just mark it finished and we will clean * up in the kcm paths, we need kcm lock which can not be * acquired here. */ KCM_STATS_INCR(mux->stats.psock_unattach_rsvd); spin_unlock_bh(&mux->lock); /* We are unattaching a socket that is reserved. Abort the * socket since we may be out of sync in sending on it. We need * to do this without the mux lock. */ kcm_abort_tx_psock(psock, EPIPE, false); spin_lock_bh(&mux->lock); if (!psock->tx_kcm) { /* psock now unreserved in window mux was unlocked */ goto no_reserved; } psock->done = 1; /* Commit done before queuing work to process it */ smp_mb(); /* Queue tx work to make sure psock->done is handled */ queue_work(kcm_wq, &psock->tx_kcm->tx_work); spin_unlock_bh(&mux->lock); } else { no_reserved: if (!psock->tx_stopped) list_del(&psock->psock_avail_list); list_del(&psock->psock_list); mux->psocks_cnt--; spin_unlock_bh(&mux->lock); sock_put(csk); fput(csk->sk_socket->file); kmem_cache_free(kcm_psockp, psock); } release_sock(csk); } static int kcm_unattach_ioctl(struct socket *sock, struct kcm_unattach *info) { struct kcm_sock *kcm = kcm_sk(sock->sk); struct kcm_mux *mux = kcm->mux; struct kcm_psock *psock; struct socket *csock; struct sock *csk; int err; csock = sockfd_lookup(info->fd, &err); if (!csock) return -ENOENT; csk = csock->sk; if (!csk) { err = -EINVAL; goto out; } err = -ENOENT; spin_lock_bh(&mux->lock); list_for_each_entry(psock, &mux->psocks, psock_list) { if (psock->sk != csk) continue; /* Found the matching psock */ if (psock->unattaching || WARN_ON(psock->done)) { err = -EALREADY; break; } psock->unattaching = 1; spin_unlock_bh(&mux->lock); /* Lower socket lock should already be held */ kcm_unattach(psock); err = 0; goto out; } spin_unlock_bh(&mux->lock); out: sockfd_put(csock); return err; } static struct proto kcm_proto = { .name = "KCM", .owner = THIS_MODULE, .obj_size = sizeof(struct kcm_sock), }; /* Clone a kcm socket. */ static struct file *kcm_clone(struct socket *osock) { struct socket *newsock; struct sock *newsk; newsock = sock_alloc(); if (!newsock) return ERR_PTR(-ENFILE); newsock->type = osock->type; newsock->ops = osock->ops; __module_get(newsock->ops->owner); newsk = sk_alloc(sock_net(osock->sk), PF_KCM, GFP_KERNEL, &kcm_proto, false); if (!newsk) { sock_release(newsock); return ERR_PTR(-ENOMEM); } sock_init_data(newsock, newsk); init_kcm_sock(kcm_sk(newsk), kcm_sk(osock->sk)->mux); return sock_alloc_file(newsock, 0, osock->sk->sk_prot_creator->name); } static int kcm_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) { int err; switch (cmd) { case SIOCKCMATTACH: { struct kcm_attach info; if (copy_from_user(&info, (void __user *)arg, sizeof(info))) return -EFAULT; err = kcm_attach_ioctl(sock, &info); break; } case SIOCKCMUNATTACH: { struct kcm_unattach info; if (copy_from_user(&info, (void __user *)arg, sizeof(info))) return -EFAULT; err = kcm_unattach_ioctl(sock, &info); break; } case SIOCKCMCLONE: { struct kcm_clone info; struct file *file; info.fd = get_unused_fd_flags(0); if (unlikely(info.fd < 0)) return info.fd; file = kcm_clone(sock); if (IS_ERR(file)) { put_unused_fd(info.fd); return PTR_ERR(file); } if (copy_to_user((void __user *)arg, &info, sizeof(info))) { put_unused_fd(info.fd); fput(file); return -EFAULT; } fd_install(info.fd, file); err = 0; break; } default: err = -ENOIOCTLCMD; break; } return err; } static void release_mux(struct kcm_mux *mux) { struct kcm_net *knet = mux->knet; struct kcm_psock *psock, *tmp_psock; /* Release psocks */ list_for_each_entry_safe(psock, tmp_psock, &mux->psocks, psock_list) { if (!WARN_ON(psock->unattaching)) kcm_unattach(psock); } if (WARN_ON(mux->psocks_cnt)) return; __skb_queue_purge(&mux->rx_hold_queue); mutex_lock(&knet->mutex); aggregate_mux_stats(&mux->stats, &knet->aggregate_mux_stats); aggregate_psock_stats(&mux->aggregate_psock_stats, &knet->aggregate_psock_stats); aggregate_strp_stats(&mux->aggregate_strp_stats, &knet->aggregate_strp_stats); list_del_rcu(&mux->kcm_mux_list); knet->count--; mutex_unlock(&knet->mutex); kfree_rcu(mux, rcu); } static void kcm_done(struct kcm_sock *kcm) { struct kcm_mux *mux = kcm->mux; struct sock *sk = &kcm->sk; int socks_cnt; spin_lock_bh(&mux->rx_lock); if (kcm->rx_psock) { /* Cleanup in unreserve_rx_kcm */ WARN_ON(kcm->done); kcm->rx_disabled = 1; kcm->done = 1; spin_unlock_bh(&mux->rx_lock); return; } if (kcm->rx_wait) { list_del(&kcm->wait_rx_list); /* paired with lockless reads in kcm_rfree() */ WRITE_ONCE(kcm->rx_wait, false); } /* Move any pending receive messages to other kcm sockets */ requeue_rx_msgs(mux, &sk->sk_receive_queue); spin_unlock_bh(&mux->rx_lock); if (WARN_ON(sk_rmem_alloc_get(sk))) return; /* Detach from MUX */ spin_lock_bh(&mux->lock); list_del(&kcm->kcm_sock_list); mux->kcm_socks_cnt--; socks_cnt = mux->kcm_socks_cnt; spin_unlock_bh(&mux->lock); if (!socks_cnt) { /* We are done with the mux now. */ release_mux(mux); } WARN_ON(kcm->rx_wait); sock_put(&kcm->sk); } /* Called by kcm_release to close a KCM socket. * If this is the last KCM socket on the MUX, destroy the MUX. */ static int kcm_release(struct socket *sock) { struct sock *sk = sock->sk; struct kcm_sock *kcm; struct kcm_mux *mux; struct kcm_psock *psock; if (!sk) return 0; kcm = kcm_sk(sk); mux = kcm->mux; lock_sock(sk); sock_orphan(sk); kfree_skb(kcm->seq_skb); /* Purge queue under lock to avoid race condition with tx_work trying * to act when queue is nonempty. If tx_work runs after this point * it will just return. */ __skb_queue_purge(&sk->sk_write_queue); release_sock(sk); spin_lock_bh(&mux->lock); if (kcm->tx_wait) { /* Take of tx_wait list, after this point there should be no way * that a psock will be assigned to this kcm. */ list_del(&kcm->wait_psock_list); kcm->tx_wait = false; } spin_unlock_bh(&mux->lock); /* Cancel work. After this point there should be no outside references * to the kcm socket. */ disable_work_sync(&kcm->tx_work); lock_sock(sk); psock = kcm->tx_psock; if (psock) { /* A psock was reserved, so we need to kill it since it * may already have some bytes queued from a message. We * need to do this after removing kcm from tx_wait list. */ kcm_abort_tx_psock(psock, EPIPE, false); unreserve_psock(kcm); } release_sock(sk); WARN_ON(kcm->tx_wait); WARN_ON(kcm->tx_psock); sock->sk = NULL; kcm_done(kcm); return 0; } static const struct proto_ops kcm_dgram_ops = { .family = PF_KCM, .owner = THIS_MODULE, .release = kcm_release, .bind = sock_no_bind, .connect = sock_no_connect, .socketpair = sock_no_socketpair, .accept = sock_no_accept, .getname = sock_no_getname, .poll = datagram_poll, .ioctl = kcm_ioctl, .listen = sock_no_listen, .shutdown = sock_no_shutdown, .setsockopt = kcm_setsockopt, .getsockopt = kcm_getsockopt, .sendmsg = kcm_sendmsg, .recvmsg = kcm_recvmsg, .mmap = sock_no_mmap, .splice_eof = kcm_splice_eof, }; static const struct proto_ops kcm_seqpacket_ops = { .family = PF_KCM, .owner = THIS_MODULE, .release = kcm_release, .bind = sock_no_bind, .connect = sock_no_connect, .socketpair = sock_no_socketpair, .accept = sock_no_accept, .getname = sock_no_getname, .poll = datagram_poll, .ioctl = kcm_ioctl, .listen = sock_no_listen, .shutdown = sock_no_shutdown, .setsockopt = kcm_setsockopt, .getsockopt = kcm_getsockopt, .sendmsg = kcm_sendmsg, .recvmsg = kcm_recvmsg, .mmap = sock_no_mmap, .splice_eof = kcm_splice_eof, .splice_read = kcm_splice_read, }; /* Create proto operation for kcm sockets */ static int kcm_create(struct net *net, struct socket *sock, int protocol, int kern) { struct kcm_net *knet = net_generic(net, kcm_net_id); struct sock *sk; struct kcm_mux *mux; switch (sock->type) { case SOCK_DGRAM: sock->ops = &kcm_dgram_ops; break; case SOCK_SEQPACKET: sock->ops = &kcm_seqpacket_ops; break; default: return -ESOCKTNOSUPPORT; } if (protocol != KCMPROTO_CONNECTED) return -EPROTONOSUPPORT; sk = sk_alloc(net, PF_KCM, GFP_KERNEL, &kcm_proto, kern); if (!sk) return -ENOMEM; /* Allocate a kcm mux, shared between KCM sockets */ mux = kmem_cache_zalloc(kcm_muxp, GFP_KERNEL); if (!mux) { sk_free(sk); return -ENOMEM; } spin_lock_init(&mux->lock); spin_lock_init(&mux->rx_lock); INIT_LIST_HEAD(&mux->kcm_socks); INIT_LIST_HEAD(&mux->kcm_rx_waiters); INIT_LIST_HEAD(&mux->kcm_tx_waiters); INIT_LIST_HEAD(&mux->psocks); INIT_LIST_HEAD(&mux->psocks_ready); INIT_LIST_HEAD(&mux->psocks_avail); mux->knet = knet; /* Add new MUX to list */ mutex_lock(&knet->mutex); list_add_rcu(&mux->kcm_mux_list, &knet->mux_list); knet->count++; mutex_unlock(&knet->mutex); skb_queue_head_init(&mux->rx_hold_queue); /* Init KCM socket */ sock_init_data(sock, sk); init_kcm_sock(kcm_sk(sk), mux); return 0; } static const struct net_proto_family kcm_family_ops = { .family = PF_KCM, .create = kcm_create, .owner = THIS_MODULE, }; static __net_init int kcm_init_net(struct net *net) { struct kcm_net *knet = net_generic(net, kcm_net_id); INIT_LIST_HEAD_RCU(&knet->mux_list); mutex_init(&knet->mutex); return 0; } static __net_exit void kcm_exit_net(struct net *net) { struct kcm_net *knet = net_generic(net, kcm_net_id); /* All KCM sockets should be closed at this point, which should mean * that all multiplexors and psocks have been destroyed. */ WARN_ON(!list_empty(&knet->mux_list)); mutex_destroy(&knet->mutex); } static struct pernet_operations kcm_net_ops = { .init = kcm_init_net, .exit = kcm_exit_net, .id = &kcm_net_id, .size = sizeof(struct kcm_net), }; static int __init kcm_init(void) { int err = -ENOMEM; kcm_muxp = KMEM_CACHE(kcm_mux, SLAB_HWCACHE_ALIGN); if (!kcm_muxp) goto fail; kcm_psockp = KMEM_CACHE(kcm_psock, SLAB_HWCACHE_ALIGN); if (!kcm_psockp) goto fail; kcm_wq = create_singlethread_workqueue("kkcmd"); if (!kcm_wq) goto fail; err = proto_register(&kcm_proto, 1); if (err) goto fail; err = register_pernet_device(&kcm_net_ops); if (err) goto net_ops_fail; err = sock_register(&kcm_family_ops); if (err) goto sock_register_fail; err = kcm_proc_init(); if (err) goto proc_init_fail; return 0; proc_init_fail: sock_unregister(PF_KCM); sock_register_fail: unregister_pernet_device(&kcm_net_ops); net_ops_fail: proto_unregister(&kcm_proto); fail: kmem_cache_destroy(kcm_muxp); kmem_cache_destroy(kcm_psockp); if (kcm_wq) destroy_workqueue(kcm_wq); return err; } static void __exit kcm_exit(void) { kcm_proc_exit(); sock_unregister(PF_KCM); unregister_pernet_device(&kcm_net_ops); proto_unregister(&kcm_proto); destroy_workqueue(kcm_wq); kmem_cache_destroy(kcm_muxp); kmem_cache_destroy(kcm_psockp); } module_init(kcm_init); module_exit(kcm_exit); MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("KCM (Kernel Connection Multiplexor) sockets"); MODULE_ALIAS_NETPROTO(PF_KCM); |
| 69 69 69 39 39 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 | /* * Copyright (c) 2014 Intel Corporation. All rights reserved. * Copyright (c) 2014 Chelsio, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. */ #include "iwpm_util.h" static const char iwpm_ulib_name[IWPM_ULIBNAME_SIZE] = "iWarpPortMapperUser"; u16 iwpm_ulib_version = IWPM_UABI_VERSION_MIN; static int iwpm_user_pid = IWPM_PID_UNDEFINED; static atomic_t echo_nlmsg_seq; /** * iwpm_valid_pid - Check if the userspace iwarp port mapper pid is valid * * Returns true if the pid is greater than zero, otherwise returns false */ int iwpm_valid_pid(void) { return iwpm_user_pid > 0; } /** * iwpm_register_pid - Send a netlink query to userspace * to get the iwarp port mapper pid * @pm_msg: Contains driver info to send to the userspace port mapper * @nl_client: The index of the netlink client * * nlmsg attributes: * [IWPM_NLA_REG_PID_SEQ] * [IWPM_NLA_REG_IF_NAME] * [IWPM_NLA_REG_IBDEV_NAME] * [IWPM_NLA_REG_ULIB_NAME] */ int iwpm_register_pid(struct iwpm_dev_data *pm_msg, u8 nl_client) { struct sk_buff *skb = NULL; struct iwpm_nlmsg_request *nlmsg_request = NULL; struct nlmsghdr *nlh; u32 msg_seq; const char *err_str = ""; int ret = -EINVAL; if (iwpm_check_registration(nl_client, IWPM_REG_VALID) || iwpm_user_pid == IWPM_PID_UNAVAILABLE) return 0; skb = iwpm_create_nlmsg(RDMA_NL_IWPM_REG_PID, &nlh, nl_client); if (!skb) { err_str = "Unable to create a nlmsg"; goto pid_query_error; } nlh->nlmsg_seq = iwpm_get_nlmsg_seq(); nlmsg_request = iwpm_get_nlmsg_request(nlh->nlmsg_seq, nl_client, GFP_KERNEL); if (!nlmsg_request) { err_str = "Unable to allocate netlink request"; goto pid_query_error; } msg_seq = atomic_read(&echo_nlmsg_seq); /* fill in the pid request message */ err_str = "Unable to put attribute of the nlmsg"; ret = ibnl_put_attr(skb, nlh, sizeof(u32), &msg_seq, IWPM_NLA_REG_PID_SEQ); if (ret) goto pid_query_error; ret = ibnl_put_attr(skb, nlh, IFNAMSIZ, pm_msg->if_name, IWPM_NLA_REG_IF_NAME); if (ret) goto pid_query_error; ret = ibnl_put_attr(skb, nlh, IWPM_DEVNAME_SIZE, pm_msg->dev_name, IWPM_NLA_REG_IBDEV_NAME); if (ret) goto pid_query_error; ret = ibnl_put_attr(skb, nlh, IWPM_ULIBNAME_SIZE, (char *)iwpm_ulib_name, IWPM_NLA_REG_ULIB_NAME); if (ret) goto pid_query_error; nlmsg_end(skb, nlh); pr_debug("%s: Multicasting a nlmsg (dev = %s ifname = %s iwpm = %s)\n", __func__, pm_msg->dev_name, pm_msg->if_name, iwpm_ulib_name); ret = rdma_nl_multicast(&init_net, skb, RDMA_NL_GROUP_IWPM, GFP_KERNEL); if (ret) { skb = NULL; /* skb is freed in the netlink send-op handling */ iwpm_user_pid = IWPM_PID_UNAVAILABLE; err_str = "Unable to send a nlmsg"; goto pid_query_error; } nlmsg_request->req_buffer = pm_msg; ret = iwpm_wait_complete_req(nlmsg_request); return ret; pid_query_error: pr_info("%s: %s (client = %u)\n", __func__, err_str, nl_client); dev_kfree_skb(skb); if (nlmsg_request) iwpm_free_nlmsg_request(&nlmsg_request->kref); return ret; } /** * iwpm_add_mapping - Send a netlink add mapping request to * the userspace port mapper * @pm_msg: Contains the local ip/tcp address info to send * @nl_client: The index of the netlink client * * nlmsg attributes: * [IWPM_NLA_MANAGE_MAPPING_SEQ] * [IWPM_NLA_MANAGE_ADDR] * [IWPM_NLA_MANAGE_FLAGS] * * If the request is successful, the pm_msg stores * the port mapper response (mapped address info) */ int iwpm_add_mapping(struct iwpm_sa_data *pm_msg, u8 nl_client) { struct sk_buff *skb = NULL; struct iwpm_nlmsg_request *nlmsg_request = NULL; struct nlmsghdr *nlh; u32 msg_seq; const char *err_str = ""; int ret = -EINVAL; if (!iwpm_valid_pid()) return 0; if (!iwpm_check_registration(nl_client, IWPM_REG_VALID)) { err_str = "Unregistered port mapper client"; goto add_mapping_error; } skb = iwpm_create_nlmsg(RDMA_NL_IWPM_ADD_MAPPING, &nlh, nl_client); if (!skb) { err_str = "Unable to create a nlmsg"; goto add_mapping_error; } nlh->nlmsg_seq = iwpm_get_nlmsg_seq(); nlmsg_request = iwpm_get_nlmsg_request(nlh->nlmsg_seq, nl_client, GFP_KERNEL); if (!nlmsg_request) { err_str = "Unable to allocate netlink request"; goto add_mapping_error; } msg_seq = atomic_read(&echo_nlmsg_seq); /* fill in the add mapping message */ err_str = "Unable to put attribute of the nlmsg"; ret = ibnl_put_attr(skb, nlh, sizeof(u32), &msg_seq, IWPM_NLA_MANAGE_MAPPING_SEQ); if (ret) goto add_mapping_error; ret = ibnl_put_attr(skb, nlh, sizeof(struct sockaddr_storage), &pm_msg->loc_addr, IWPM_NLA_MANAGE_ADDR); if (ret) goto add_mapping_error; /* If flags are required and we're not V4, then return a quiet error */ if (pm_msg->flags && iwpm_ulib_version == IWPM_UABI_VERSION_MIN) { ret = -EINVAL; goto add_mapping_error_nowarn; } if (iwpm_ulib_version > IWPM_UABI_VERSION_MIN) { ret = ibnl_put_attr(skb, nlh, sizeof(u32), &pm_msg->flags, IWPM_NLA_MANAGE_FLAGS); if (ret) goto add_mapping_error; } nlmsg_end(skb, nlh); nlmsg_request->req_buffer = pm_msg; ret = rdma_nl_unicast_wait(&init_net, skb, iwpm_user_pid); if (ret) { skb = NULL; /* skb is freed in the netlink send-op handling */ iwpm_user_pid = IWPM_PID_UNDEFINED; err_str = "Unable to send a nlmsg"; goto add_mapping_error; } ret = iwpm_wait_complete_req(nlmsg_request); return ret; add_mapping_error: pr_info("%s: %s (client = %u)\n", __func__, err_str, nl_client); add_mapping_error_nowarn: dev_kfree_skb(skb); if (nlmsg_request) iwpm_free_nlmsg_request(&nlmsg_request->kref); return ret; } /** * iwpm_add_and_query_mapping - Process the port mapper response to * iwpm_add_and_query_mapping request * @pm_msg: Contains the local ip/tcp address info to send * @nl_client: The index of the netlink client * * nlmsg attributes: * [IWPM_NLA_QUERY_MAPPING_SEQ] * [IWPM_NLA_QUERY_LOCAL_ADDR] * [IWPM_NLA_QUERY_REMOTE_ADDR] * [IWPM_NLA_QUERY_FLAGS] */ int iwpm_add_and_query_mapping(struct iwpm_sa_data *pm_msg, u8 nl_client) { struct sk_buff *skb = NULL; struct iwpm_nlmsg_request *nlmsg_request = NULL; struct nlmsghdr *nlh; u32 msg_seq; const char *err_str = ""; int ret = -EINVAL; if (!iwpm_valid_pid()) return 0; if (!iwpm_check_registration(nl_client, IWPM_REG_VALID)) { err_str = "Unregistered port mapper client"; goto query_mapping_error; } ret = -ENOMEM; skb = iwpm_create_nlmsg(RDMA_NL_IWPM_QUERY_MAPPING, &nlh, nl_client); if (!skb) { err_str = "Unable to create a nlmsg"; goto query_mapping_error; } nlh->nlmsg_seq = iwpm_get_nlmsg_seq(); nlmsg_request = iwpm_get_nlmsg_request(nlh->nlmsg_seq, nl_client, GFP_KERNEL); if (!nlmsg_request) { err_str = "Unable to allocate netlink request"; goto query_mapping_error; } msg_seq = atomic_read(&echo_nlmsg_seq); /* fill in the query message */ err_str = "Unable to put attribute of the nlmsg"; ret = ibnl_put_attr(skb, nlh, sizeof(u32), &msg_seq, IWPM_NLA_QUERY_MAPPING_SEQ); if (ret) goto query_mapping_error; ret = ibnl_put_attr(skb, nlh, sizeof(struct sockaddr_storage), &pm_msg->loc_addr, IWPM_NLA_QUERY_LOCAL_ADDR); if (ret) goto query_mapping_error; ret = ibnl_put_attr(skb, nlh, sizeof(struct sockaddr_storage), &pm_msg->rem_addr, IWPM_NLA_QUERY_REMOTE_ADDR); if (ret) goto query_mapping_error; /* If flags are required and we're not V4, then return a quite error */ if (pm_msg->flags && iwpm_ulib_version == IWPM_UABI_VERSION_MIN) { ret = -EINVAL; goto query_mapping_error_nowarn; } if (iwpm_ulib_version > IWPM_UABI_VERSION_MIN) { ret = ibnl_put_attr(skb, nlh, sizeof(u32), &pm_msg->flags, IWPM_NLA_QUERY_FLAGS); if (ret) goto query_mapping_error; } nlmsg_end(skb, nlh); nlmsg_request->req_buffer = pm_msg; ret = rdma_nl_unicast_wait(&init_net, skb, iwpm_user_pid); if (ret) { skb = NULL; /* skb is freed in the netlink send-op handling */ err_str = "Unable to send a nlmsg"; goto query_mapping_error; } ret = iwpm_wait_complete_req(nlmsg_request); return ret; query_mapping_error: pr_info("%s: %s (client = %u)\n", __func__, err_str, nl_client); query_mapping_error_nowarn: dev_kfree_skb(skb); if (nlmsg_request) iwpm_free_nlmsg_request(&nlmsg_request->kref); return ret; } /** * iwpm_remove_mapping - Send a netlink remove mapping request * to the userspace port mapper * * @local_addr: Local ip/tcp address to remove * @nl_client: The index of the netlink client * * nlmsg attributes: * [IWPM_NLA_MANAGE_MAPPING_SEQ] * [IWPM_NLA_MANAGE_ADDR] */ int iwpm_remove_mapping(struct sockaddr_storage *local_addr, u8 nl_client) { struct sk_buff *skb = NULL; struct nlmsghdr *nlh; u32 msg_seq; const char *err_str = ""; int ret = -EINVAL; if (!iwpm_valid_pid()) return 0; if (iwpm_check_registration(nl_client, IWPM_REG_UNDEF)) { err_str = "Unregistered port mapper client"; goto remove_mapping_error; } skb = iwpm_create_nlmsg(RDMA_NL_IWPM_REMOVE_MAPPING, &nlh, nl_client); if (!skb) { ret = -ENOMEM; err_str = "Unable to create a nlmsg"; goto remove_mapping_error; } msg_seq = atomic_read(&echo_nlmsg_seq); nlh->nlmsg_seq = iwpm_get_nlmsg_seq(); err_str = "Unable to put attribute of the nlmsg"; ret = ibnl_put_attr(skb, nlh, sizeof(u32), &msg_seq, IWPM_NLA_MANAGE_MAPPING_SEQ); if (ret) goto remove_mapping_error; ret = ibnl_put_attr(skb, nlh, sizeof(struct sockaddr_storage), local_addr, IWPM_NLA_MANAGE_ADDR); if (ret) goto remove_mapping_error; nlmsg_end(skb, nlh); ret = rdma_nl_unicast_wait(&init_net, skb, iwpm_user_pid); if (ret) { skb = NULL; /* skb is freed in the netlink send-op handling */ iwpm_user_pid = IWPM_PID_UNDEFINED; err_str = "Unable to send a nlmsg"; goto remove_mapping_error; } iwpm_print_sockaddr(local_addr, "remove_mapping: Local sockaddr:"); return 0; remove_mapping_error: pr_info("%s: %s (client = %u)\n", __func__, err_str, nl_client); if (skb) dev_kfree_skb_any(skb); return ret; } /* netlink attribute policy for the received response to register pid request */ static const struct nla_policy resp_reg_policy[IWPM_NLA_RREG_PID_MAX] = { [IWPM_NLA_RREG_PID_SEQ] = { .type = NLA_U32 }, [IWPM_NLA_RREG_IBDEV_NAME] = { .type = NLA_STRING, .len = IWPM_DEVNAME_SIZE - 1 }, [IWPM_NLA_RREG_ULIB_NAME] = { .type = NLA_STRING, .len = IWPM_ULIBNAME_SIZE - 1 }, [IWPM_NLA_RREG_ULIB_VER] = { .type = NLA_U16 }, [IWPM_NLA_RREG_PID_ERR] = { .type = NLA_U16 } }; /** * iwpm_register_pid_cb - Process the port mapper response to * iwpm_register_pid query * @skb: The socket buffer * @cb: Contains the received message (payload and netlink header) * * If successful, the function receives the userspace port mapper pid * which is used in future communication with the port mapper */ int iwpm_register_pid_cb(struct sk_buff *skb, struct netlink_callback *cb) { struct iwpm_nlmsg_request *nlmsg_request = NULL; struct nlattr *nltb[IWPM_NLA_RREG_PID_MAX]; struct iwpm_dev_data *pm_msg; char *dev_name, *iwpm_name; u32 msg_seq; u8 nl_client; u16 iwpm_version; const char *msg_type = "Register Pid response"; if (iwpm_parse_nlmsg(cb, IWPM_NLA_RREG_PID_MAX, resp_reg_policy, nltb, msg_type)) return -EINVAL; msg_seq = nla_get_u32(nltb[IWPM_NLA_RREG_PID_SEQ]); nlmsg_request = iwpm_find_nlmsg_request(msg_seq); if (!nlmsg_request) { pr_info("%s: Could not find a matching request (seq = %u)\n", __func__, msg_seq); return -EINVAL; } pm_msg = nlmsg_request->req_buffer; nl_client = nlmsg_request->nl_client; dev_name = (char *)nla_data(nltb[IWPM_NLA_RREG_IBDEV_NAME]); iwpm_name = (char *)nla_data(nltb[IWPM_NLA_RREG_ULIB_NAME]); iwpm_version = nla_get_u16(nltb[IWPM_NLA_RREG_ULIB_VER]); /* check device name, ulib name and version */ if (strcmp(pm_msg->dev_name, dev_name) || strcmp(iwpm_ulib_name, iwpm_name) || iwpm_version < IWPM_UABI_VERSION_MIN) { pr_info("%s: Incorrect info (dev = %s name = %s version = %u)\n", __func__, dev_name, iwpm_name, iwpm_version); nlmsg_request->err_code = IWPM_USER_LIB_INFO_ERR; goto register_pid_response_exit; } iwpm_user_pid = cb->nlh->nlmsg_pid; iwpm_ulib_version = iwpm_version; if (iwpm_ulib_version < IWPM_UABI_VERSION) pr_warn_once("%s: Down level iwpmd/pid %d. Continuing...", __func__, iwpm_user_pid); atomic_set(&echo_nlmsg_seq, cb->nlh->nlmsg_seq); pr_debug("%s: iWarp Port Mapper (pid = %d) is available!\n", __func__, iwpm_user_pid); iwpm_set_registration(nl_client, IWPM_REG_VALID); register_pid_response_exit: nlmsg_request->request_done = 1; /* always for found nlmsg_request */ kref_put(&nlmsg_request->kref, iwpm_free_nlmsg_request); barrier(); up(&nlmsg_request->sem); return 0; } /* netlink attribute policy for the received response to add mapping request */ static const struct nla_policy resp_add_policy[IWPM_NLA_RMANAGE_MAPPING_MAX] = { [IWPM_NLA_RMANAGE_MAPPING_SEQ] = { .type = NLA_U32 }, [IWPM_NLA_RMANAGE_ADDR] = { .len = sizeof(struct sockaddr_storage) }, [IWPM_NLA_RMANAGE_MAPPED_LOC_ADDR] = { .len = sizeof(struct sockaddr_storage) }, [IWPM_NLA_RMANAGE_MAPPING_ERR] = { .type = NLA_U16 } }; /** * iwpm_add_mapping_cb - Process the port mapper response to * iwpm_add_mapping request * @skb: The socket buffer * @cb: Contains the received message (payload and netlink header) */ int iwpm_add_mapping_cb(struct sk_buff *skb, struct netlink_callback *cb) { struct iwpm_sa_data *pm_msg; struct iwpm_nlmsg_request *nlmsg_request = NULL; struct nlattr *nltb[IWPM_NLA_RMANAGE_MAPPING_MAX]; struct sockaddr_storage *local_sockaddr; struct sockaddr_storage *mapped_sockaddr; const char *msg_type; u32 msg_seq; msg_type = "Add Mapping response"; if (iwpm_parse_nlmsg(cb, IWPM_NLA_RMANAGE_MAPPING_MAX, resp_add_policy, nltb, msg_type)) return -EINVAL; atomic_set(&echo_nlmsg_seq, cb->nlh->nlmsg_seq); msg_seq = nla_get_u32(nltb[IWPM_NLA_RMANAGE_MAPPING_SEQ]); nlmsg_request = iwpm_find_nlmsg_request(msg_seq); if (!nlmsg_request) { pr_info("%s: Could not find a matching request (seq = %u)\n", __func__, msg_seq); return -EINVAL; } pm_msg = nlmsg_request->req_buffer; local_sockaddr = (struct sockaddr_storage *) nla_data(nltb[IWPM_NLA_RMANAGE_ADDR]); mapped_sockaddr = (struct sockaddr_storage *) nla_data(nltb[IWPM_NLA_RMANAGE_MAPPED_LOC_ADDR]); if (iwpm_compare_sockaddr(local_sockaddr, &pm_msg->loc_addr)) { nlmsg_request->err_code = IWPM_USER_LIB_INFO_ERR; goto add_mapping_response_exit; } if (mapped_sockaddr->ss_family != local_sockaddr->ss_family) { pr_info("%s: Sockaddr family doesn't match the requested one\n", __func__); nlmsg_request->err_code = IWPM_USER_LIB_INFO_ERR; goto add_mapping_response_exit; } memcpy(&pm_msg->mapped_loc_addr, mapped_sockaddr, sizeof(*mapped_sockaddr)); iwpm_print_sockaddr(&pm_msg->loc_addr, "add_mapping: Local sockaddr:"); iwpm_print_sockaddr(&pm_msg->mapped_loc_addr, "add_mapping: Mapped local sockaddr:"); add_mapping_response_exit: nlmsg_request->request_done = 1; /* always for found request */ kref_put(&nlmsg_request->kref, iwpm_free_nlmsg_request); barrier(); up(&nlmsg_request->sem); return 0; } /* netlink attribute policy for the response to add and query mapping request * and response with remote address info */ static const struct nla_policy resp_query_policy[IWPM_NLA_RQUERY_MAPPING_MAX] = { [IWPM_NLA_RQUERY_MAPPING_SEQ] = { .type = NLA_U32 }, [IWPM_NLA_RQUERY_LOCAL_ADDR] = { .len = sizeof(struct sockaddr_storage) }, [IWPM_NLA_RQUERY_REMOTE_ADDR] = { .len = sizeof(struct sockaddr_storage) }, [IWPM_NLA_RQUERY_MAPPED_LOC_ADDR] = { .len = sizeof(struct sockaddr_storage) }, [IWPM_NLA_RQUERY_MAPPED_REM_ADDR] = { .len = sizeof(struct sockaddr_storage) }, [IWPM_NLA_RQUERY_MAPPING_ERR] = { .type = NLA_U16 } }; /** * iwpm_add_and_query_mapping_cb - Process the port mapper response to * iwpm_add_and_query_mapping request * @skb: The socket buffer * @cb: Contains the received message (payload and netlink header) */ int iwpm_add_and_query_mapping_cb(struct sk_buff *skb, struct netlink_callback *cb) { struct iwpm_sa_data *pm_msg; struct iwpm_nlmsg_request *nlmsg_request = NULL; struct nlattr *nltb[IWPM_NLA_RQUERY_MAPPING_MAX]; struct sockaddr_storage *local_sockaddr, *remote_sockaddr; struct sockaddr_storage *mapped_loc_sockaddr, *mapped_rem_sockaddr; const char *msg_type; u32 msg_seq; u16 err_code; msg_type = "Query Mapping response"; if (iwpm_parse_nlmsg(cb, IWPM_NLA_RQUERY_MAPPING_MAX, resp_query_policy, nltb, msg_type)) return -EINVAL; atomic_set(&echo_nlmsg_seq, cb->nlh->nlmsg_seq); msg_seq = nla_get_u32(nltb[IWPM_NLA_RQUERY_MAPPING_SEQ]); nlmsg_request = iwpm_find_nlmsg_request(msg_seq); if (!nlmsg_request) { pr_info("%s: Could not find a matching request (seq = %u)\n", __func__, msg_seq); return -EINVAL; } pm_msg = nlmsg_request->req_buffer; local_sockaddr = (struct sockaddr_storage *) nla_data(nltb[IWPM_NLA_RQUERY_LOCAL_ADDR]); remote_sockaddr = (struct sockaddr_storage *) nla_data(nltb[IWPM_NLA_RQUERY_REMOTE_ADDR]); mapped_loc_sockaddr = (struct sockaddr_storage *) nla_data(nltb[IWPM_NLA_RQUERY_MAPPED_LOC_ADDR]); mapped_rem_sockaddr = (struct sockaddr_storage *) nla_data(nltb[IWPM_NLA_RQUERY_MAPPED_REM_ADDR]); err_code = nla_get_u16(nltb[IWPM_NLA_RQUERY_MAPPING_ERR]); if (err_code == IWPM_REMOTE_QUERY_REJECT) { pr_info("%s: Received a Reject (pid = %u, echo seq = %u)\n", __func__, cb->nlh->nlmsg_pid, msg_seq); nlmsg_request->err_code = IWPM_REMOTE_QUERY_REJECT; } if (iwpm_compare_sockaddr(local_sockaddr, &pm_msg->loc_addr) || iwpm_compare_sockaddr(remote_sockaddr, &pm_msg->rem_addr)) { pr_info("%s: Incorrect local sockaddr\n", __func__); nlmsg_request->err_code = IWPM_USER_LIB_INFO_ERR; goto query_mapping_response_exit; } if (mapped_loc_sockaddr->ss_family != local_sockaddr->ss_family || mapped_rem_sockaddr->ss_family != remote_sockaddr->ss_family) { pr_info("%s: Sockaddr family doesn't match the requested one\n", __func__); nlmsg_request->err_code = IWPM_USER_LIB_INFO_ERR; goto query_mapping_response_exit; } memcpy(&pm_msg->mapped_loc_addr, mapped_loc_sockaddr, sizeof(*mapped_loc_sockaddr)); memcpy(&pm_msg->mapped_rem_addr, mapped_rem_sockaddr, sizeof(*mapped_rem_sockaddr)); iwpm_print_sockaddr(&pm_msg->loc_addr, "query_mapping: Local sockaddr:"); iwpm_print_sockaddr(&pm_msg->mapped_loc_addr, "query_mapping: Mapped local sockaddr:"); iwpm_print_sockaddr(&pm_msg->rem_addr, "query_mapping: Remote sockaddr:"); iwpm_print_sockaddr(&pm_msg->mapped_rem_addr, "query_mapping: Mapped remote sockaddr:"); query_mapping_response_exit: nlmsg_request->request_done = 1; /* always for found request */ kref_put(&nlmsg_request->kref, iwpm_free_nlmsg_request); barrier(); up(&nlmsg_request->sem); return 0; } /** * iwpm_remote_info_cb - Process remote connecting peer address info, which * the port mapper has received from the connecting peer * @skb: The socket buffer * @cb: Contains the received message (payload and netlink header) * * Stores the IPv4/IPv6 address info in a hash table */ int iwpm_remote_info_cb(struct sk_buff *skb, struct netlink_callback *cb) { struct nlattr *nltb[IWPM_NLA_RQUERY_MAPPING_MAX]; struct sockaddr_storage *local_sockaddr, *remote_sockaddr; struct sockaddr_storage *mapped_loc_sockaddr, *mapped_rem_sockaddr; struct iwpm_remote_info *rem_info; const char *msg_type; u8 nl_client; int ret = -EINVAL; msg_type = "Remote Mapping info"; if (iwpm_parse_nlmsg(cb, IWPM_NLA_RQUERY_MAPPING_MAX, resp_query_policy, nltb, msg_type)) return ret; nl_client = RDMA_NL_GET_CLIENT(cb->nlh->nlmsg_type); atomic_set(&echo_nlmsg_seq, cb->nlh->nlmsg_seq); local_sockaddr = (struct sockaddr_storage *) nla_data(nltb[IWPM_NLA_RQUERY_LOCAL_ADDR]); remote_sockaddr = (struct sockaddr_storage *) nla_data(nltb[IWPM_NLA_RQUERY_REMOTE_ADDR]); mapped_loc_sockaddr = (struct sockaddr_storage *) nla_data(nltb[IWPM_NLA_RQUERY_MAPPED_LOC_ADDR]); mapped_rem_sockaddr = (struct sockaddr_storage *) nla_data(nltb[IWPM_NLA_RQUERY_MAPPED_REM_ADDR]); if (mapped_loc_sockaddr->ss_family != local_sockaddr->ss_family || mapped_rem_sockaddr->ss_family != remote_sockaddr->ss_family) { pr_info("%s: Sockaddr family doesn't match the requested one\n", __func__); return ret; } rem_info = kzalloc(sizeof(struct iwpm_remote_info), GFP_ATOMIC); if (!rem_info) { ret = -ENOMEM; return ret; } memcpy(&rem_info->mapped_loc_sockaddr, mapped_loc_sockaddr, sizeof(struct sockaddr_storage)); memcpy(&rem_info->remote_sockaddr, remote_sockaddr, sizeof(struct sockaddr_storage)); memcpy(&rem_info->mapped_rem_sockaddr, mapped_rem_sockaddr, sizeof(struct sockaddr_storage)); rem_info->nl_client = nl_client; iwpm_add_remote_info(rem_info); iwpm_print_sockaddr(local_sockaddr, "remote_info: Local sockaddr:"); iwpm_print_sockaddr(mapped_loc_sockaddr, "remote_info: Mapped local sockaddr:"); iwpm_print_sockaddr(remote_sockaddr, "remote_info: Remote sockaddr:"); iwpm_print_sockaddr(mapped_rem_sockaddr, "remote_info: Mapped remote sockaddr:"); return ret; } /* netlink attribute policy for the received request for mapping info */ static const struct nla_policy resp_mapinfo_policy[IWPM_NLA_MAPINFO_REQ_MAX] = { [IWPM_NLA_MAPINFO_ULIB_NAME] = { .type = NLA_STRING, .len = IWPM_ULIBNAME_SIZE - 1 }, [IWPM_NLA_MAPINFO_ULIB_VER] = { .type = NLA_U16 } }; /** * iwpm_mapping_info_cb - Process a notification that the userspace * port mapper daemon is started * @skb: The socket buffer * @cb: Contains the received message (payload and netlink header) * * Using the received port mapper pid, send all the local mapping * info records to the userspace port mapper */ int iwpm_mapping_info_cb(struct sk_buff *skb, struct netlink_callback *cb) { struct nlattr *nltb[IWPM_NLA_MAPINFO_REQ_MAX]; const char *msg_type = "Mapping Info response"; u8 nl_client; char *iwpm_name; u16 iwpm_version; int ret = -EINVAL; if (iwpm_parse_nlmsg(cb, IWPM_NLA_MAPINFO_REQ_MAX, resp_mapinfo_policy, nltb, msg_type)) { pr_info("%s: Unable to parse nlmsg\n", __func__); return ret; } iwpm_name = (char *)nla_data(nltb[IWPM_NLA_MAPINFO_ULIB_NAME]); iwpm_version = nla_get_u16(nltb[IWPM_NLA_MAPINFO_ULIB_VER]); if (strcmp(iwpm_ulib_name, iwpm_name) || iwpm_version < IWPM_UABI_VERSION_MIN) { pr_info("%s: Invalid port mapper name = %s version = %u\n", __func__, iwpm_name, iwpm_version); return ret; } nl_client = RDMA_NL_GET_CLIENT(cb->nlh->nlmsg_type); iwpm_set_registration(nl_client, IWPM_REG_INCOMPL); atomic_set(&echo_nlmsg_seq, cb->nlh->nlmsg_seq); iwpm_user_pid = cb->nlh->nlmsg_pid; if (iwpm_ulib_version < IWPM_UABI_VERSION) pr_warn_once("%s: Down level iwpmd/pid %d. Continuing...", __func__, iwpm_user_pid); if (!iwpm_mapinfo_available()) return 0; pr_debug("%s: iWarp Port Mapper (pid = %d) is available!\n", __func__, iwpm_user_pid); ret = iwpm_send_mapinfo(nl_client, iwpm_user_pid); return ret; } /* netlink attribute policy for the received mapping info ack */ static const struct nla_policy ack_mapinfo_policy[IWPM_NLA_MAPINFO_NUM_MAX] = { [IWPM_NLA_MAPINFO_SEQ] = { .type = NLA_U32 }, [IWPM_NLA_MAPINFO_SEND_NUM] = { .type = NLA_U32 }, [IWPM_NLA_MAPINFO_ACK_NUM] = { .type = NLA_U32 } }; /** * iwpm_ack_mapping_info_cb - Process the port mapper ack for * the provided local mapping info records * @skb: The socket buffer * @cb: Contains the received message (payload and netlink header) */ int iwpm_ack_mapping_info_cb(struct sk_buff *skb, struct netlink_callback *cb) { struct nlattr *nltb[IWPM_NLA_MAPINFO_NUM_MAX]; u32 mapinfo_send, mapinfo_ack; const char *msg_type = "Mapping Info Ack"; if (iwpm_parse_nlmsg(cb, IWPM_NLA_MAPINFO_NUM_MAX, ack_mapinfo_policy, nltb, msg_type)) return -EINVAL; mapinfo_send = nla_get_u32(nltb[IWPM_NLA_MAPINFO_SEND_NUM]); mapinfo_ack = nla_get_u32(nltb[IWPM_NLA_MAPINFO_ACK_NUM]); if (mapinfo_ack != mapinfo_send) pr_info("%s: Invalid mapinfo number (sent = %u ack-ed = %u)\n", __func__, mapinfo_send, mapinfo_ack); atomic_set(&echo_nlmsg_seq, cb->nlh->nlmsg_seq); return 0; } /* netlink attribute policy for the received port mapper error message */ static const struct nla_policy map_error_policy[IWPM_NLA_ERR_MAX] = { [IWPM_NLA_ERR_SEQ] = { .type = NLA_U32 }, [IWPM_NLA_ERR_CODE] = { .type = NLA_U16 }, }; /** * iwpm_mapping_error_cb - Process port mapper notification for error * * @skb: The socket buffer * @cb: Contains the received message (payload and netlink header) */ int iwpm_mapping_error_cb(struct sk_buff *skb, struct netlink_callback *cb) { struct iwpm_nlmsg_request *nlmsg_request = NULL; int nl_client = RDMA_NL_GET_CLIENT(cb->nlh->nlmsg_type); struct nlattr *nltb[IWPM_NLA_ERR_MAX]; u32 msg_seq; u16 err_code; const char *msg_type = "Mapping Error Msg"; if (iwpm_parse_nlmsg(cb, IWPM_NLA_ERR_MAX, map_error_policy, nltb, msg_type)) return -EINVAL; msg_seq = nla_get_u32(nltb[IWPM_NLA_ERR_SEQ]); err_code = nla_get_u16(nltb[IWPM_NLA_ERR_CODE]); pr_info("%s: Received msg seq = %u err code = %u client = %d\n", __func__, msg_seq, err_code, nl_client); /* look for nlmsg_request */ nlmsg_request = iwpm_find_nlmsg_request(msg_seq); if (!nlmsg_request) { /* not all errors have associated requests */ pr_debug("Could not find matching req (seq = %u)\n", msg_seq); return 0; } atomic_set(&echo_nlmsg_seq, cb->nlh->nlmsg_seq); nlmsg_request->err_code = err_code; nlmsg_request->request_done = 1; /* always for found request */ kref_put(&nlmsg_request->kref, iwpm_free_nlmsg_request); barrier(); up(&nlmsg_request->sem); return 0; } /* netlink attribute policy for the received hello request */ static const struct nla_policy hello_policy[IWPM_NLA_HELLO_MAX] = { [IWPM_NLA_HELLO_ABI_VERSION] = { .type = NLA_U16 } }; /** * iwpm_hello_cb - Process a hello message from iwpmd * * @skb: The socket buffer * @cb: Contains the received message (payload and netlink header) * * Using the received port mapper pid, send the kernel's abi_version * after adjusting it to support the iwpmd version. */ int iwpm_hello_cb(struct sk_buff *skb, struct netlink_callback *cb) { struct nlattr *nltb[IWPM_NLA_HELLO_MAX]; const char *msg_type = "Hello request"; u8 nl_client; u16 abi_version; int ret = -EINVAL; if (iwpm_parse_nlmsg(cb, IWPM_NLA_HELLO_MAX, hello_policy, nltb, msg_type)) { pr_info("%s: Unable to parse nlmsg\n", __func__); return ret; } abi_version = nla_get_u16(nltb[IWPM_NLA_HELLO_ABI_VERSION]); nl_client = RDMA_NL_GET_CLIENT(cb->nlh->nlmsg_type); iwpm_set_registration(nl_client, IWPM_REG_INCOMPL); atomic_set(&echo_nlmsg_seq, cb->nlh->nlmsg_seq); iwpm_ulib_version = min_t(u16, IWPM_UABI_VERSION, abi_version); pr_debug("Using ABI version %u\n", iwpm_ulib_version); iwpm_user_pid = cb->nlh->nlmsg_pid; ret = iwpm_send_hello(nl_client, iwpm_user_pid, iwpm_ulib_version); return ret; } |
| 190 12 18 5716 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _ASM_X86_PARAVIRT_H #define _ASM_X86_PARAVIRT_H /* Various instructions on x86 need to be replaced for * para-virtualization: those hooks are defined here. */ #include <asm/paravirt_types.h> #ifndef __ASSEMBLER__ struct mm_struct; #endif #ifdef CONFIG_PARAVIRT #include <asm/pgtable_types.h> #include <asm/asm.h> #include <asm/nospec-branch.h> #ifndef __ASSEMBLER__ #include <linux/bug.h> #include <linux/types.h> #include <linux/cpumask.h> #include <linux/static_call_types.h> #include <asm/frame.h> u64 dummy_steal_clock(int cpu); u64 dummy_sched_clock(void); DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock); DECLARE_STATIC_CALL(pv_sched_clock, dummy_sched_clock); void paravirt_set_sched_clock(u64 (*func)(void)); static __always_inline u64 paravirt_sched_clock(void) { return static_call(pv_sched_clock)(); } struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; __visible void __native_queued_spin_unlock(struct qspinlock *lock); bool pv_is_native_spin_unlock(void); __visible bool __native_vcpu_is_preempted(long cpu); bool pv_is_native_vcpu_is_preempted(void); static inline u64 paravirt_steal_clock(int cpu) { return static_call(pv_steal_clock)(cpu); } #ifdef CONFIG_PARAVIRT_SPINLOCKS void __init paravirt_set_cap(void); #endif /* The paravirtualized I/O functions */ static inline void slow_down_io(void) { PVOP_VCALL0(cpu.io_delay); #ifdef REALLY_SLOW_IO PVOP_VCALL0(cpu.io_delay); PVOP_VCALL0(cpu.io_delay); PVOP_VCALL0(cpu.io_delay); #endif } void native_flush_tlb_local(void); void native_flush_tlb_global(void); void native_flush_tlb_one_user(unsigned long addr); void native_flush_tlb_multi(const struct cpumask *cpumask, const struct flush_tlb_info *info); static inline void __flush_tlb_local(void) { PVOP_VCALL0(mmu.flush_tlb_user); } static inline void __flush_tlb_global(void) { PVOP_VCALL0(mmu.flush_tlb_kernel); } static inline void __flush_tlb_one_user(unsigned long addr) { PVOP_VCALL1(mmu.flush_tlb_one_user, addr); } static inline void __flush_tlb_multi(const struct cpumask *cpumask, const struct flush_tlb_info *info) { PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info); } static inline void paravirt_arch_exit_mmap(struct mm_struct *mm) { PVOP_VCALL1(mmu.exit_mmap, mm); } static inline void notify_page_enc_status_changed(unsigned long pfn, int npages, bool enc) { PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc); } static __always_inline void arch_safe_halt(void) { PVOP_VCALL0(irq.safe_halt); } static inline void halt(void) { PVOP_VCALL0(irq.halt); } #ifdef CONFIG_PARAVIRT_XXL static inline void load_sp0(unsigned long sp0) { PVOP_VCALL1(cpu.load_sp0, sp0); } /* The paravirtualized CPUID instruction. */ static inline void __cpuid(unsigned int *eax, unsigned int *ebx, unsigned int *ecx, unsigned int *edx) { PVOP_VCALL4(cpu.cpuid, eax, ebx, ecx, edx); } /* * These special macros can be used to get or set a debugging register */ static __always_inline unsigned long paravirt_get_debugreg(int reg) { return PVOP_CALL1(unsigned long, cpu.get_debugreg, reg); } #define get_debugreg(var, reg) var = paravirt_get_debugreg(reg) static __always_inline void set_debugreg(unsigned long val, int reg) { PVOP_VCALL2(cpu.set_debugreg, reg, val); } static inline unsigned long read_cr0(void) { return PVOP_CALL0(unsigned long, cpu.read_cr0); } static inline void write_cr0(unsigned long x) { PVOP_VCALL1(cpu.write_cr0, x); } static __always_inline unsigned long read_cr2(void) { return PVOP_ALT_CALLEE0(unsigned long, mmu.read_cr2, "mov %%cr2, %%rax;", ALT_NOT_XEN); } static __always_inline void write_cr2(unsigned long x) { PVOP_VCALL1(mmu.write_cr2, x); } static inline unsigned long __read_cr3(void) { return PVOP_ALT_CALL0(unsigned long, mmu.read_cr3, "mov %%cr3, %%rax;", ALT_NOT_XEN); } static inline void write_cr3(unsigned long x) { PVOP_ALT_VCALL1(mmu.write_cr3, x, "mov %%rdi, %%cr3", ALT_NOT_XEN); } static inline void __write_cr4(unsigned long x) { PVOP_VCALL1(cpu.write_cr4, x); } static inline u64 paravirt_read_msr(u32 msr) { return PVOP_CALL1(u64, cpu.read_msr, msr); } static inline void paravirt_write_msr(u32 msr, u64 val) { PVOP_VCALL2(cpu.write_msr, msr, val); } static inline int paravirt_read_msr_safe(u32 msr, u64 *val) { return PVOP_CALL2(int, cpu.read_msr_safe, msr, val); } static inline int paravirt_write_msr_safe(u32 msr, u64 val) { return PVOP_CALL2(int, cpu.write_msr_safe, msr, val); } #define rdmsr(msr, val1, val2) \ do { \ u64 _l = paravirt_read_msr(msr); \ val1 = (u32)_l; \ val2 = _l >> 32; \ } while (0) static __always_inline void wrmsr(u32 msr, u32 low, u32 high) { paravirt_write_msr(msr, (u64)high << 32 | low); } #define rdmsrq(msr, val) \ do { \ val = paravirt_read_msr(msr); \ } while (0) static inline void wrmsrq(u32 msr, u64 val) { paravirt_write_msr(msr, val); } static inline int wrmsrq_safe(u32 msr, u64 val) { return paravirt_write_msr_safe(msr, val); } /* rdmsr with exception handling */ #define rdmsr_safe(msr, a, b) \ ({ \ u64 _l; \ int _err = paravirt_read_msr_safe((msr), &_l); \ (*a) = (u32)_l; \ (*b) = (u32)(_l >> 32); \ _err; \ }) static __always_inline int rdmsrq_safe(u32 msr, u64 *p) { return paravirt_read_msr_safe(msr, p); } static __always_inline u64 rdpmc(int counter) { return PVOP_CALL1(u64, cpu.read_pmc, counter); } static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries) { PVOP_VCALL2(cpu.alloc_ldt, ldt, entries); } static inline void paravirt_free_ldt(struct desc_struct *ldt, unsigned entries) { PVOP_VCALL2(cpu.free_ldt, ldt, entries); } static inline void load_TR_desc(void) { PVOP_VCALL0(cpu.load_tr_desc); } static inline void load_gdt(const struct desc_ptr *dtr) { PVOP_VCALL1(cpu.load_gdt, dtr); } static inline void load_idt(const struct desc_ptr *dtr) { PVOP_VCALL1(cpu.load_idt, dtr); } static inline void set_ldt(const void *addr, unsigned entries) { PVOP_VCALL2(cpu.set_ldt, addr, entries); } static inline unsigned long paravirt_store_tr(void) { return PVOP_CALL0(unsigned long, cpu.store_tr); } #define store_tr(tr) ((tr) = paravirt_store_tr()) static inline void load_TLS(struct thread_struct *t, unsigned cpu) { PVOP_VCALL2(cpu.load_tls, t, cpu); } static inline void load_gs_index(unsigned int gs) { PVOP_VCALL1(cpu.load_gs_index, gs); } static inline void write_ldt_entry(struct desc_struct *dt, int entry, const void *desc) { PVOP_VCALL3(cpu.write_ldt_entry, dt, entry, desc); } static inline void write_gdt_entry(struct desc_struct *dt, int entry, void *desc, int type) { PVOP_VCALL4(cpu.write_gdt_entry, dt, entry, desc, type); } static inline void write_idt_entry(gate_desc *dt, int entry, const gate_desc *g) { PVOP_VCALL3(cpu.write_idt_entry, dt, entry, g); } #ifdef CONFIG_X86_IOPL_IOPERM static inline void tss_invalidate_io_bitmap(void) { PVOP_VCALL0(cpu.invalidate_io_bitmap); } static inline void tss_update_io_bitmap(void) { PVOP_VCALL0(cpu.update_io_bitmap); } #endif static inline void paravirt_enter_mmap(struct mm_struct *next) { PVOP_VCALL1(mmu.enter_mmap, next); } static inline int paravirt_pgd_alloc(struct mm_struct *mm) { return PVOP_CALL1(int, mmu.pgd_alloc, mm); } static inline void paravirt_pgd_free(struct mm_struct *mm, pgd_t *pgd) { PVOP_VCALL2(mmu.pgd_free, mm, pgd); } static inline void paravirt_alloc_pte(struct mm_struct *mm, unsigned long pfn) { PVOP_VCALL2(mmu.alloc_pte, mm, pfn); } static inline void paravirt_release_pte(unsigned long pfn) { PVOP_VCALL1(mmu.release_pte, pfn); } static inline void paravirt_alloc_pmd(struct mm_struct *mm, unsigned long pfn) { PVOP_VCALL2(mmu.alloc_pmd, mm, pfn); } static inline void paravirt_release_pmd(unsigned long pfn) { PVOP_VCALL1(mmu.release_pmd, pfn); } static inline void paravirt_alloc_pud(struct mm_struct *mm, unsigned long pfn) { PVOP_VCALL2(mmu.alloc_pud, mm, pfn); } static inline void paravirt_release_pud(unsigned long pfn) { PVOP_VCALL1(mmu.release_pud, pfn); } static inline void paravirt_alloc_p4d(struct mm_struct *mm, unsigned long pfn) { PVOP_VCALL2(mmu.alloc_p4d, mm, pfn); } static inline void paravirt_release_p4d(unsigned long pfn) { PVOP_VCALL1(mmu.release_p4d, pfn); } static inline pte_t __pte(pteval_t val) { return (pte_t) { PVOP_ALT_CALLEE1(pteval_t, mmu.make_pte, val, "mov %%rdi, %%rax", ALT_NOT_XEN) }; } static inline pteval_t pte_val(pte_t pte) { return PVOP_ALT_CALLEE1(pteval_t, mmu.pte_val, pte.pte, "mov %%rdi, %%rax", ALT_NOT_XEN); } static inline pgd_t __pgd(pgdval_t val) { return (pgd_t) { PVOP_ALT_CALLEE1(pgdval_t, mmu.make_pgd, val, "mov %%rdi, %%rax", ALT_NOT_XEN) }; } static inline pgdval_t pgd_val(pgd_t pgd) { return PVOP_ALT_CALLEE1(pgdval_t, mmu.pgd_val, pgd.pgd, "mov %%rdi, %%rax", ALT_NOT_XEN); } #define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { pteval_t ret; ret = PVOP_CALL3(pteval_t, mmu.ptep_modify_prot_start, vma, addr, ptep); return (pte_t) { .pte = ret }; } static inline void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t old_pte, pte_t pte) { PVOP_VCALL4(mmu.ptep_modify_prot_commit, vma, addr, ptep, pte.pte); } static inline void set_pte(pte_t *ptep, pte_t pte) { PVOP_VCALL2(mmu.set_pte, ptep, pte.pte); } static inline void set_pmd(pmd_t *pmdp, pmd_t pmd) { PVOP_VCALL2(mmu.set_pmd, pmdp, native_pmd_val(pmd)); } static inline pmd_t __pmd(pmdval_t val) { return (pmd_t) { PVOP_ALT_CALLEE1(pmdval_t, mmu.make_pmd, val, "mov %%rdi, %%rax", ALT_NOT_XEN) }; } static inline pmdval_t pmd_val(pmd_t pmd) { return PVOP_ALT_CALLEE1(pmdval_t, mmu.pmd_val, pmd.pmd, "mov %%rdi, %%rax", ALT_NOT_XEN); } static inline void set_pud(pud_t *pudp, pud_t pud) { PVOP_VCALL2(mmu.set_pud, pudp, native_pud_val(pud)); } static inline pud_t __pud(pudval_t val) { pudval_t ret; ret = PVOP_ALT_CALLEE1(pudval_t, mmu.make_pud, val, "mov %%rdi, %%rax", ALT_NOT_XEN); return (pud_t) { ret }; } static inline pudval_t pud_val(pud_t pud) { return PVOP_ALT_CALLEE1(pudval_t, mmu.pud_val, pud.pud, "mov %%rdi, %%rax", ALT_NOT_XEN); } static inline void pud_clear(pud_t *pudp) { set_pud(pudp, native_make_pud(0)); } static inline void set_p4d(p4d_t *p4dp, p4d_t p4d) { p4dval_t val = native_p4d_val(p4d); PVOP_VCALL2(mmu.set_p4d, p4dp, val); } static inline p4d_t __p4d(p4dval_t val) { p4dval_t ret = PVOP_ALT_CALLEE1(p4dval_t, mmu.make_p4d, val, "mov %%rdi, %%rax", ALT_NOT_XEN); return (p4d_t) { ret }; } static inline p4dval_t p4d_val(p4d_t p4d) { return PVOP_ALT_CALLEE1(p4dval_t, mmu.p4d_val, p4d.p4d, "mov %%rdi, %%rax", ALT_NOT_XEN); } static inline void __set_pgd(pgd_t *pgdp, pgd_t pgd) { PVOP_VCALL2(mmu.set_pgd, pgdp, native_pgd_val(pgd)); } #define set_pgd(pgdp, pgdval) do { \ if (pgtable_l5_enabled()) \ __set_pgd(pgdp, pgdval); \ else \ set_p4d((p4d_t *)(pgdp), (p4d_t) { (pgdval).pgd }); \ } while (0) #define pgd_clear(pgdp) do { \ if (pgtable_l5_enabled()) \ set_pgd(pgdp, native_make_pgd(0)); \ } while (0) static inline void p4d_clear(p4d_t *p4dp) { set_p4d(p4dp, native_make_p4d(0)); } static inline void set_pte_atomic(pte_t *ptep, pte_t pte) { set_pte(ptep, pte); } static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { set_pte(ptep, native_make_pte(0)); } static inline void pmd_clear(pmd_t *pmdp) { set_pmd(pmdp, native_make_pmd(0)); } #define __HAVE_ARCH_START_CONTEXT_SWITCH static inline void arch_start_context_switch(struct task_struct *prev) { PVOP_VCALL1(cpu.start_context_switch, prev); } static inline void arch_end_context_switch(struct task_struct *next) { PVOP_VCALL1(cpu.end_context_switch, next); } #define __HAVE_ARCH_ENTER_LAZY_MMU_MODE static inline void arch_enter_lazy_mmu_mode(void) { PVOP_VCALL0(mmu.lazy_mode.enter); } static inline void arch_leave_lazy_mmu_mode(void) { PVOP_VCALL0(mmu.lazy_mode.leave); } static inline void arch_flush_lazy_mmu_mode(void) { PVOP_VCALL0(mmu.lazy_mode.flush); } static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, phys_addr_t phys, pgprot_t flags) { pv_ops.mmu.set_fixmap(idx, phys, flags); } #endif #if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS) static __always_inline void pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) { PVOP_VCALL2(lock.queued_spin_lock_slowpath, lock, val); } static __always_inline void pv_queued_spin_unlock(struct qspinlock *lock) { PVOP_ALT_VCALLEE1(lock.queued_spin_unlock, lock, "movb $0, (%%" _ASM_ARG1 ");", ALT_NOT(X86_FEATURE_PVUNLOCK)); } static __always_inline void pv_wait(u8 *ptr, u8 val) { PVOP_VCALL2(lock.wait, ptr, val); } static __always_inline void pv_kick(int cpu) { PVOP_VCALL1(lock.kick, cpu); } static __always_inline bool pv_vcpu_is_preempted(long cpu) { return PVOP_ALT_CALLEE1(bool, lock.vcpu_is_preempted, cpu, "xor %%" _ASM_AX ", %%" _ASM_AX ";", ALT_NOT(X86_FEATURE_VCPUPREEMPT)); } void __raw_callee_save___native_queued_spin_unlock(struct qspinlock *lock); bool __raw_callee_save___native_vcpu_is_preempted(long cpu); #endif /* SMP && PARAVIRT_SPINLOCKS */ #ifdef CONFIG_X86_32 /* save and restore all caller-save registers, except return value */ #define PV_SAVE_ALL_CALLER_REGS "pushl %ecx;" #define PV_RESTORE_ALL_CALLER_REGS "popl %ecx;" #else /* save and restore all caller-save registers, except return value */ #define PV_SAVE_ALL_CALLER_REGS \ "push %rcx;" \ "push %rdx;" \ "push %rsi;" \ "push %rdi;" \ "push %r8;" \ "push %r9;" \ "push %r10;" \ "push %r11;" #define PV_RESTORE_ALL_CALLER_REGS \ "pop %r11;" \ "pop %r10;" \ "pop %r9;" \ "pop %r8;" \ "pop %rdi;" \ "pop %rsi;" \ "pop %rdx;" \ "pop %rcx;" #endif /* * Generate a thunk around a function which saves all caller-save * registers except for the return value. This allows C functions to * be called from assembler code where fewer than normal registers are * available. It may also help code generation around calls from C * code if the common case doesn't use many registers. * * When a callee is wrapped in a thunk, the caller can assume that all * arg regs and all scratch registers are preserved across the * call. The return value in rax/eax will not be saved, even for void * functions. */ #define PV_THUNK_NAME(func) "__raw_callee_save_" #func #define __PV_CALLEE_SAVE_REGS_THUNK(func, section) \ extern typeof(func) __raw_callee_save_##func; \ \ asm(".pushsection " section ", \"ax\";" \ ".globl " PV_THUNK_NAME(func) ";" \ ".type " PV_THUNK_NAME(func) ", @function;" \ ASM_FUNC_ALIGN \ PV_THUNK_NAME(func) ":" \ ASM_ENDBR \ FRAME_BEGIN \ PV_SAVE_ALL_CALLER_REGS \ "call " #func ";" \ PV_RESTORE_ALL_CALLER_REGS \ FRAME_END \ ASM_RET \ ".size " PV_THUNK_NAME(func) ", .-" PV_THUNK_NAME(func) ";" \ ".popsection") #define PV_CALLEE_SAVE_REGS_THUNK(func) \ __PV_CALLEE_SAVE_REGS_THUNK(func, ".text") /* Get a reference to a callee-save function */ #define PV_CALLEE_SAVE(func) \ ((struct paravirt_callee_save) { __raw_callee_save_##func }) /* Promise that "func" already uses the right calling convention */ #define __PV_IS_CALLEE_SAVE(func) \ ((struct paravirt_callee_save) { func }) #ifdef CONFIG_PARAVIRT_XXL static __always_inline unsigned long arch_local_save_flags(void) { return PVOP_ALT_CALLEE0(unsigned long, irq.save_fl, "pushf; pop %%rax;", ALT_NOT_XEN); } static __always_inline void arch_local_irq_disable(void) { PVOP_ALT_VCALLEE0(irq.irq_disable, "cli;", ALT_NOT_XEN); } static __always_inline void arch_local_irq_enable(void) { PVOP_ALT_VCALLEE0(irq.irq_enable, "sti;", ALT_NOT_XEN); } static __always_inline unsigned long arch_local_irq_save(void) { unsigned long f; f = arch_local_save_flags(); arch_local_irq_disable(); return f; } #endif /* Make sure as little as possible of this mess escapes. */ #undef PARAVIRT_CALL #undef __PVOP_CALL #undef __PVOP_VCALL #undef PVOP_VCALL0 #undef PVOP_CALL0 #undef PVOP_VCALL1 #undef PVOP_CALL1 #undef PVOP_VCALL2 #undef PVOP_CALL2 #undef PVOP_VCALL3 #undef PVOP_CALL3 #undef PVOP_VCALL4 #undef PVOP_CALL4 extern void default_banner(void); void native_pv_lock_init(void) __init; #else /* __ASSEMBLER__ */ #ifdef CONFIG_X86_64 #ifdef CONFIG_PARAVIRT_XXL #ifdef CONFIG_DEBUG_ENTRY #define PARA_INDIRECT(addr) *addr(%rip) .macro PARA_IRQ_save_fl ANNOTATE_RETPOLINE_SAFE; call PARA_INDIRECT(pv_ops+PV_IRQ_save_fl); .endm #define SAVE_FLAGS ALTERNATIVE_2 "PARA_IRQ_save_fl;", \ "ALT_CALL_INSTR;", ALT_CALL_ALWAYS, \ "pushf; pop %rax;", ALT_NOT_XEN #endif #endif /* CONFIG_PARAVIRT_XXL */ #endif /* CONFIG_X86_64 */ #endif /* __ASSEMBLER__ */ #else /* CONFIG_PARAVIRT */ # define default_banner x86_init_noop #ifndef __ASSEMBLER__ static inline void native_pv_lock_init(void) { } #endif #endif /* !CONFIG_PARAVIRT */ #ifndef __ASSEMBLER__ #ifndef CONFIG_PARAVIRT_XXL static inline void paravirt_enter_mmap(struct mm_struct *mm) { } #endif #ifndef CONFIG_PARAVIRT static inline void paravirt_arch_exit_mmap(struct mm_struct *mm) { } #endif #ifndef CONFIG_PARAVIRT_SPINLOCKS static inline void paravirt_set_cap(void) { } #endif #endif /* __ASSEMBLER__ */ #endif /* _ASM_X86_PARAVIRT_H */ |
| 56 56 56 56 52 52 1 52 52 1 52 52 56 52 52 52 52 1 51 52 52 56 56 52 52 52 51 51 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 | // SPDX-License-Identifier: GPL-2.0-or-later /* * The Serio abstraction module * * Copyright (c) 1999-2004 Vojtech Pavlik * Copyright (c) 2004 Dmitry Torokhov * Copyright (c) 2003 Daniele Bellucci */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/stddef.h> #include <linux/module.h> #include <linux/serio.h> #include <linux/errno.h> #include <linux/sched.h> #include <linux/slab.h> #include <linux/workqueue.h> #include <linux/mutex.h> MODULE_AUTHOR("Vojtech Pavlik <vojtech@ucw.cz>"); MODULE_DESCRIPTION("Serio abstraction core"); MODULE_LICENSE("GPL"); /* * serio_mutex protects entire serio subsystem and is taken every time * serio port or driver registered or unregistered. */ static DEFINE_MUTEX(serio_mutex); static LIST_HEAD(serio_list); static void serio_add_port(struct serio *serio); static int serio_reconnect_port(struct serio *serio); static void serio_disconnect_port(struct serio *serio); static void serio_reconnect_subtree(struct serio *serio); static void serio_attach_driver(struct serio_driver *drv); static int serio_connect_driver(struct serio *serio, struct serio_driver *drv) { guard(mutex)(&serio->drv_mutex); return drv->connect(serio, drv); } static int serio_reconnect_driver(struct serio *serio) { guard(mutex)(&serio->drv_mutex); if (serio->drv && serio->drv->reconnect) return serio->drv->reconnect(serio); return -1; } static void serio_disconnect_driver(struct serio *serio) { guard(mutex)(&serio->drv_mutex); if (serio->drv) serio->drv->disconnect(serio); } static int serio_match_port(const struct serio_device_id *ids, struct serio *serio) { while (ids->type || ids->proto) { if ((ids->type == SERIO_ANY || ids->type == serio->id.type) && (ids->proto == SERIO_ANY || ids->proto == serio->id.proto) && (ids->extra == SERIO_ANY || ids->extra == serio->id.extra) && (ids->id == SERIO_ANY || ids->id == serio->id.id)) return 1; ids++; } return 0; } /* * Basic serio -> driver core mappings */ static int serio_bind_driver(struct serio *serio, struct serio_driver *drv) { int error; if (serio_match_port(drv->id_table, serio)) { serio->dev.driver = &drv->driver; if (serio_connect_driver(serio, drv)) { serio->dev.driver = NULL; return -ENODEV; } error = device_bind_driver(&serio->dev); if (error) { dev_warn(&serio->dev, "device_bind_driver() failed for %s (%s) and %s, error: %d\n", serio->phys, serio->name, drv->description, error); serio_disconnect_driver(serio); serio->dev.driver = NULL; return error; } } return 0; } static void serio_find_driver(struct serio *serio) { int error; error = device_attach(&serio->dev); if (error < 0 && error != -EPROBE_DEFER) dev_warn(&serio->dev, "device_attach() failed for %s (%s), error: %d\n", serio->phys, serio->name, error); } /* * Serio event processing. */ enum serio_event_type { SERIO_RESCAN_PORT, SERIO_RECONNECT_PORT, SERIO_RECONNECT_SUBTREE, SERIO_REGISTER_PORT, SERIO_ATTACH_DRIVER, }; struct serio_event { enum serio_event_type type; void *object; struct module *owner; struct list_head node; }; static DEFINE_SPINLOCK(serio_event_lock); /* protects serio_event_list */ static LIST_HEAD(serio_event_list); static struct serio_event *serio_get_event(void) { struct serio_event *event = NULL; guard(spinlock_irqsave)(&serio_event_lock); if (!list_empty(&serio_event_list)) { event = list_first_entry(&serio_event_list, struct serio_event, node); list_del_init(&event->node); } return event; } static void serio_free_event(struct serio_event *event) { module_put(event->owner); kfree(event); } static void serio_remove_duplicate_events(void *object, enum serio_event_type type) { struct serio_event *e, *next; guard(spinlock_irqsave)(&serio_event_lock); list_for_each_entry_safe(e, next, &serio_event_list, node) { if (object == e->object) { /* * If this event is of different type we should not * look further - we only suppress duplicate events * that were sent back-to-back. */ if (type != e->type) break; list_del_init(&e->node); serio_free_event(e); } } } static void serio_handle_event(struct work_struct *work) { struct serio_event *event; guard(mutex)(&serio_mutex); while ((event = serio_get_event())) { switch (event->type) { case SERIO_REGISTER_PORT: serio_add_port(event->object); break; case SERIO_RECONNECT_PORT: serio_reconnect_port(event->object); break; case SERIO_RESCAN_PORT: serio_disconnect_port(event->object); serio_find_driver(event->object); break; case SERIO_RECONNECT_SUBTREE: serio_reconnect_subtree(event->object); break; case SERIO_ATTACH_DRIVER: serio_attach_driver(event->object); break; } serio_remove_duplicate_events(event->object, event->type); serio_free_event(event); } } static DECLARE_WORK(serio_event_work, serio_handle_event); static int serio_queue_event(void *object, struct module *owner, enum serio_event_type event_type) { struct serio_event *event; guard(spinlock_irqsave)(&serio_event_lock); /* * Scan event list for the other events for the same serio port, * starting with the most recent one. If event is the same we * do not need add new one. If event is of different type we * need to add this event and should not look further because * we need to preseve sequence of distinct events. */ list_for_each_entry_reverse(event, &serio_event_list, node) { if (event->object == object) { if (event->type == event_type) return 0; break; } } event = kmalloc(sizeof(*event), GFP_ATOMIC); if (!event) { pr_err("Not enough memory to queue event %d\n", event_type); return -ENOMEM; } if (!try_module_get(owner)) { pr_warn("Can't get module reference, dropping event %d\n", event_type); kfree(event); return -EINVAL; } event->type = event_type; event->object = object; event->owner = owner; list_add_tail(&event->node, &serio_event_list); queue_work(system_long_wq, &serio_event_work); return 0; } /* * Remove all events that have been submitted for a given * object, be it serio port or driver. */ static void serio_remove_pending_events(void *object) { struct serio_event *event, *next; guard(spinlock_irqsave)(&serio_event_lock); list_for_each_entry_safe(event, next, &serio_event_list, node) { if (event->object == object) { list_del_init(&event->node); serio_free_event(event); } } } /* * Locate child serio port (if any) that has not been fully registered yet. * * Children are registered by driver's connect() handler so there can't be a * grandchild pending registration together with a child. */ static struct serio *serio_get_pending_child(struct serio *parent) { struct serio_event *event; struct serio *serio; guard(spinlock_irqsave)(&serio_event_lock); list_for_each_entry(event, &serio_event_list, node) { if (event->type == SERIO_REGISTER_PORT) { serio = event->object; if (serio->parent == parent) return serio; } } return NULL; } /* * Serio port operations */ static ssize_t serio_show_description(struct device *dev, struct device_attribute *attr, char *buf) { struct serio *serio = to_serio_port(dev); return sprintf(buf, "%s\n", serio->name); } static ssize_t modalias_show(struct device *dev, struct device_attribute *attr, char *buf) { struct serio *serio = to_serio_port(dev); return sprintf(buf, "serio:ty%02Xpr%02Xid%02Xex%02X\n", serio->id.type, serio->id.proto, serio->id.id, serio->id.extra); } static ssize_t type_show(struct device *dev, struct device_attribute *attr, char *buf) { struct serio *serio = to_serio_port(dev); return sprintf(buf, "%02x\n", serio->id.type); } static ssize_t proto_show(struct device *dev, struct device_attribute *attr, char *buf) { struct serio *serio = to_serio_port(dev); return sprintf(buf, "%02x\n", serio->id.proto); } static ssize_t id_show(struct device *dev, struct device_attribute *attr, char *buf) { struct serio *serio = to_serio_port(dev); return sprintf(buf, "%02x\n", serio->id.id); } static ssize_t extra_show(struct device *dev, struct device_attribute *attr, char *buf) { struct serio *serio = to_serio_port(dev); return sprintf(buf, "%02x\n", serio->id.extra); } static ssize_t drvctl_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct serio *serio = to_serio_port(dev); struct device_driver *drv; int error; scoped_cond_guard(mutex_intr, return -EINTR, &serio_mutex) { if (!strncmp(buf, "none", count)) { serio_disconnect_port(serio); } else if (!strncmp(buf, "reconnect", count)) { serio_reconnect_subtree(serio); } else if (!strncmp(buf, "rescan", count)) { serio_disconnect_port(serio); serio_find_driver(serio); serio_remove_duplicate_events(serio, SERIO_RESCAN_PORT); } else if ((drv = driver_find(buf, &serio_bus)) != NULL) { serio_disconnect_port(serio); error = serio_bind_driver(serio, to_serio_driver(drv)); serio_remove_duplicate_events(serio, SERIO_RESCAN_PORT); if (error) return error; } else { return -EINVAL; } } return count; } static ssize_t serio_show_bind_mode(struct device *dev, struct device_attribute *attr, char *buf) { struct serio *serio = to_serio_port(dev); return sprintf(buf, "%s\n", serio->manual_bind ? "manual" : "auto"); } static ssize_t serio_set_bind_mode(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct serio *serio = to_serio_port(dev); int retval; retval = count; if (!strncmp(buf, "manual", count)) { serio->manual_bind = true; } else if (!strncmp(buf, "auto", count)) { serio->manual_bind = false; } else { retval = -EINVAL; } return retval; } static ssize_t firmware_id_show(struct device *dev, struct device_attribute *attr, char *buf) { struct serio *serio = to_serio_port(dev); return sprintf(buf, "%s\n", serio->firmware_id); } static DEVICE_ATTR_RO(type); static DEVICE_ATTR_RO(proto); static DEVICE_ATTR_RO(id); static DEVICE_ATTR_RO(extra); static struct attribute *serio_device_id_attrs[] = { &dev_attr_type.attr, &dev_attr_proto.attr, &dev_attr_id.attr, &dev_attr_extra.attr, NULL }; static const struct attribute_group serio_id_attr_group = { .name = "id", .attrs = serio_device_id_attrs, }; static DEVICE_ATTR_RO(modalias); static DEVICE_ATTR_WO(drvctl); static DEVICE_ATTR(description, S_IRUGO, serio_show_description, NULL); static DEVICE_ATTR(bind_mode, S_IWUSR | S_IRUGO, serio_show_bind_mode, serio_set_bind_mode); static DEVICE_ATTR_RO(firmware_id); static struct attribute *serio_device_attrs[] = { &dev_attr_modalias.attr, &dev_attr_description.attr, &dev_attr_drvctl.attr, &dev_attr_bind_mode.attr, &dev_attr_firmware_id.attr, NULL }; static const struct attribute_group serio_device_attr_group = { .attrs = serio_device_attrs, }; static const struct attribute_group *serio_device_attr_groups[] = { &serio_id_attr_group, &serio_device_attr_group, NULL }; static void serio_release_port(struct device *dev) { struct serio *serio = to_serio_port(dev); kfree(serio); module_put(THIS_MODULE); } /* * Prepare serio port for registration. */ static void serio_init_port(struct serio *serio) { static atomic_t serio_no = ATOMIC_INIT(-1); __module_get(THIS_MODULE); INIT_LIST_HEAD(&serio->node); INIT_LIST_HEAD(&serio->child_node); INIT_LIST_HEAD(&serio->children); spin_lock_init(&serio->lock); mutex_init(&serio->drv_mutex); device_initialize(&serio->dev); dev_set_name(&serio->dev, "serio%lu", (unsigned long)atomic_inc_return(&serio_no)); serio->dev.bus = &serio_bus; serio->dev.release = serio_release_port; serio->dev.groups = serio_device_attr_groups; if (serio->parent) { serio->dev.parent = &serio->parent->dev; serio->depth = serio->parent->depth + 1; } else serio->depth = 0; lockdep_set_subclass(&serio->lock, serio->depth); } /* * Complete serio port registration. * Driver core will attempt to find appropriate driver for the port. */ static void serio_add_port(struct serio *serio) { struct serio *parent = serio->parent; int error; if (parent) { guard(serio_pause_rx)(parent); list_add_tail(&serio->child_node, &parent->children); } list_add_tail(&serio->node, &serio_list); if (serio->start) serio->start(serio); error = device_add(&serio->dev); if (error) dev_err(&serio->dev, "device_add() failed for %s (%s), error: %d\n", serio->phys, serio->name, error); } /* * serio_destroy_port() completes unregistration process and removes * port from the system */ static void serio_destroy_port(struct serio *serio) { struct serio *child; while ((child = serio_get_pending_child(serio)) != NULL) { serio_remove_pending_events(child); put_device(&child->dev); } if (serio->stop) serio->stop(serio); if (serio->parent) { guard(serio_pause_rx)(serio->parent); list_del_init(&serio->child_node); serio->parent = NULL; } if (device_is_registered(&serio->dev)) device_del(&serio->dev); list_del_init(&serio->node); serio_remove_pending_events(serio); put_device(&serio->dev); } /* * Reconnect serio port (re-initialize attached device). * If reconnect fails (old device is no longer attached or * there was no device to begin with) we do full rescan in * hope of finding a driver for the port. */ static int serio_reconnect_port(struct serio *serio) { int error = serio_reconnect_driver(serio); if (error) { serio_disconnect_port(serio); serio_find_driver(serio); } return error; } /* * Reconnect serio port and all its children (re-initialize attached * devices). */ static void serio_reconnect_subtree(struct serio *root) { struct serio *s = root; int error; do { error = serio_reconnect_port(s); if (!error) { /* * Reconnect was successful, move on to do the * first child. */ if (!list_empty(&s->children)) { s = list_first_entry(&s->children, struct serio, child_node); continue; } } /* * Either it was a leaf node or reconnect failed and it * became a leaf node. Continue reconnecting starting with * the next sibling of the parent node. */ while (s != root) { struct serio *parent = s->parent; if (!list_is_last(&s->child_node, &parent->children)) { s = list_entry(s->child_node.next, struct serio, child_node); break; } s = parent; } } while (s != root); } /* * serio_disconnect_port() unbinds a port from its driver. As a side effect * all children ports are unbound and destroyed. */ static void serio_disconnect_port(struct serio *serio) { struct serio *s = serio; /* * Children ports should be disconnected and destroyed * first; we travel the tree in depth-first order. */ while (!list_empty(&serio->children)) { /* Locate a leaf */ while (!list_empty(&s->children)) s = list_first_entry(&s->children, struct serio, child_node); /* * Prune this leaf node unless it is the one we * started with. */ if (s != serio) { struct serio *parent = s->parent; device_release_driver(&s->dev); serio_destroy_port(s); s = parent; } } /* * OK, no children left, now disconnect this port. */ device_release_driver(&serio->dev); } void serio_rescan(struct serio *serio) { serio_queue_event(serio, NULL, SERIO_RESCAN_PORT); } EXPORT_SYMBOL(serio_rescan); void serio_reconnect(struct serio *serio) { serio_queue_event(serio, NULL, SERIO_RECONNECT_SUBTREE); } EXPORT_SYMBOL(serio_reconnect); /* * Submits register request to kseriod for subsequent execution. * Note that port registration is always asynchronous. */ void __serio_register_port(struct serio *serio, struct module *owner) { serio_init_port(serio); serio_queue_event(serio, owner, SERIO_REGISTER_PORT); } EXPORT_SYMBOL(__serio_register_port); /* * Synchronously unregisters serio port. */ void serio_unregister_port(struct serio *serio) { guard(mutex)(&serio_mutex); serio_disconnect_port(serio); serio_destroy_port(serio); } EXPORT_SYMBOL(serio_unregister_port); /* * Safely unregisters children ports if they are present. */ void serio_unregister_child_port(struct serio *serio) { struct serio *s, *next; guard(mutex)(&serio_mutex); list_for_each_entry_safe(s, next, &serio->children, child_node) { serio_disconnect_port(s); serio_destroy_port(s); } } EXPORT_SYMBOL(serio_unregister_child_port); /* * Serio driver operations */ static ssize_t description_show(struct device_driver *drv, char *buf) { struct serio_driver *driver = to_serio_driver(drv); return sprintf(buf, "%s\n", driver->description ? driver->description : "(none)"); } static DRIVER_ATTR_RO(description); static ssize_t bind_mode_show(struct device_driver *drv, char *buf) { struct serio_driver *serio_drv = to_serio_driver(drv); return sprintf(buf, "%s\n", serio_drv->manual_bind ? "manual" : "auto"); } static ssize_t bind_mode_store(struct device_driver *drv, const char *buf, size_t count) { struct serio_driver *serio_drv = to_serio_driver(drv); int retval; retval = count; if (!strncmp(buf, "manual", count)) { serio_drv->manual_bind = true; } else if (!strncmp(buf, "auto", count)) { serio_drv->manual_bind = false; } else { retval = -EINVAL; } return retval; } static DRIVER_ATTR_RW(bind_mode); static struct attribute *serio_driver_attrs[] = { &driver_attr_description.attr, &driver_attr_bind_mode.attr, NULL, }; ATTRIBUTE_GROUPS(serio_driver); static int serio_driver_probe(struct device *dev) { struct serio *serio = to_serio_port(dev); struct serio_driver *drv = to_serio_driver(dev->driver); return serio_connect_driver(serio, drv); } static void serio_driver_remove(struct device *dev) { struct serio *serio = to_serio_port(dev); serio_disconnect_driver(serio); } static void serio_cleanup(struct serio *serio) { guard(mutex)(&serio->drv_mutex); if (serio->drv && serio->drv->cleanup) serio->drv->cleanup(serio); } static void serio_shutdown(struct device *dev) { struct serio *serio = to_serio_port(dev); serio_cleanup(serio); } static void serio_attach_driver(struct serio_driver *drv) { int error; error = driver_attach(&drv->driver); if (error) pr_warn("driver_attach() failed for %s with error %d\n", drv->driver.name, error); } int __serio_register_driver(struct serio_driver *drv, struct module *owner, const char *mod_name) { bool manual_bind = drv->manual_bind; int error; drv->driver.bus = &serio_bus; drv->driver.owner = owner; drv->driver.mod_name = mod_name; /* * Temporarily disable automatic binding because probing * takes long time and we are better off doing it in kseriod */ drv->manual_bind = true; error = driver_register(&drv->driver); if (error) { pr_err("driver_register() failed for %s, error: %d\n", drv->driver.name, error); return error; } /* * Restore original bind mode and let kseriod bind the * driver to free ports */ if (!manual_bind) { drv->manual_bind = false; error = serio_queue_event(drv, NULL, SERIO_ATTACH_DRIVER); if (error) { driver_unregister(&drv->driver); return error; } } return 0; } EXPORT_SYMBOL(__serio_register_driver); void serio_unregister_driver(struct serio_driver *drv) { struct serio *serio; guard(mutex)(&serio_mutex); drv->manual_bind = true; /* so serio_find_driver ignores it */ serio_remove_pending_events(drv); start_over: list_for_each_entry(serio, &serio_list, node) { if (serio->drv == drv) { serio_disconnect_port(serio); serio_find_driver(serio); /* we could've deleted some ports, restart */ goto start_over; } } driver_unregister(&drv->driver); } EXPORT_SYMBOL(serio_unregister_driver); static void serio_set_drv(struct serio *serio, struct serio_driver *drv) { guard(serio_pause_rx)(serio); serio->drv = drv; } static int serio_bus_match(struct device *dev, const struct device_driver *drv) { struct serio *serio = to_serio_port(dev); const struct serio_driver *serio_drv = to_serio_driver(drv); if (serio->manual_bind || serio_drv->manual_bind) return 0; return serio_match_port(serio_drv->id_table, serio); } #define SERIO_ADD_UEVENT_VAR(fmt, val...) \ do { \ int err = add_uevent_var(env, fmt, val); \ if (err) \ return err; \ } while (0) static int serio_uevent(const struct device *dev, struct kobj_uevent_env *env) { const struct serio *serio; if (!dev) return -ENODEV; serio = to_serio_port(dev); SERIO_ADD_UEVENT_VAR("SERIO_TYPE=%02x", serio->id.type); SERIO_ADD_UEVENT_VAR("SERIO_PROTO=%02x", serio->id.proto); SERIO_ADD_UEVENT_VAR("SERIO_ID=%02x", serio->id.id); SERIO_ADD_UEVENT_VAR("SERIO_EXTRA=%02x", serio->id.extra); SERIO_ADD_UEVENT_VAR("MODALIAS=serio:ty%02Xpr%02Xid%02Xex%02X", serio->id.type, serio->id.proto, serio->id.id, serio->id.extra); if (serio->firmware_id[0]) SERIO_ADD_UEVENT_VAR("SERIO_FIRMWARE_ID=%s", serio->firmware_id); return 0; } #undef SERIO_ADD_UEVENT_VAR #ifdef CONFIG_PM static int serio_suspend(struct device *dev) { struct serio *serio = to_serio_port(dev); serio_cleanup(serio); return 0; } static int serio_resume(struct device *dev) { struct serio *serio = to_serio_port(dev); int error = -ENOENT; scoped_guard(mutex, &serio->drv_mutex) { if (serio->drv && serio->drv->fast_reconnect) { error = serio->drv->fast_reconnect(serio); if (error && error != -ENOENT) dev_warn(dev, "fast reconnect failed with error %d\n", error); } } if (error) { /* * Driver reconnect can take a while, so better let * kseriod deal with it. */ serio_queue_event(serio, NULL, SERIO_RECONNECT_PORT); } return 0; } static const struct dev_pm_ops serio_pm_ops = { .suspend = serio_suspend, .resume = serio_resume, .poweroff = serio_suspend, .restore = serio_resume, }; #endif /* CONFIG_PM */ /* called from serio_driver->connect/disconnect methods under serio_mutex */ int serio_open(struct serio *serio, struct serio_driver *drv) { serio_set_drv(serio, drv); if (serio->open && serio->open(serio)) { serio_set_drv(serio, NULL); return -1; } return 0; } EXPORT_SYMBOL(serio_open); /* called from serio_driver->connect/disconnect methods under serio_mutex */ void serio_close(struct serio *serio) { if (serio->close) serio->close(serio); serio_set_drv(serio, NULL); } EXPORT_SYMBOL(serio_close); irqreturn_t serio_interrupt(struct serio *serio, unsigned char data, unsigned int dfl) { guard(spinlock_irqsave)(&serio->lock); if (likely(serio->drv)) return serio->drv->interrupt(serio, data, dfl); if (!dfl && device_is_registered(&serio->dev)) { serio_rescan(serio); return IRQ_HANDLED; } return IRQ_NONE; } EXPORT_SYMBOL(serio_interrupt); const struct bus_type serio_bus = { .name = "serio", .drv_groups = serio_driver_groups, .match = serio_bus_match, .uevent = serio_uevent, .probe = serio_driver_probe, .remove = serio_driver_remove, .shutdown = serio_shutdown, #ifdef CONFIG_PM .pm = &serio_pm_ops, #endif }; EXPORT_SYMBOL(serio_bus); static int __init serio_init(void) { int error; error = bus_register(&serio_bus); if (error) { pr_err("Failed to register serio bus, error: %d\n", error); return error; } return 0; } static void __exit serio_exit(void) { bus_unregister(&serio_bus); /* * There should not be any outstanding events but work may * still be scheduled so simply cancel it. */ cancel_work_sync(&serio_event_work); } subsys_initcall(serio_init); module_exit(serio_exit); |
| 717 551 231 232 276 277 174 174 322 322 338 340 268 268 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 | // SPDX-License-Identifier: GPL-2.0-only /* * Copyright (C) 2016 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. */ #include <linux/kernel.h> #include <linux/init.h> #include <linux/module.h> #include <linux/cache.h> #include <linux/random.h> #include <linux/hrtimer.h> #include <linux/ktime.h> #include <linux/string.h> #include <linux/net.h> #include <linux/siphash.h> #include <net/secure_seq.h> #if IS_ENABLED(CONFIG_IPV6) || IS_ENABLED(CONFIG_INET) #include <linux/in6.h> #include <net/tcp.h> static siphash_aligned_key_t net_secret; static siphash_aligned_key_t ts_secret; #define EPHEMERAL_PORT_SHUFFLE_PERIOD (10 * HZ) static __always_inline void net_secret_init(void) { net_get_random_once(&net_secret, sizeof(net_secret)); } static __always_inline void ts_secret_init(void) { net_get_random_once(&ts_secret, sizeof(ts_secret)); } #endif #ifdef CONFIG_INET static u32 seq_scale(u32 seq) { /* * As close as possible to RFC 793, which * suggests using a 250 kHz clock. * Further reading shows this assumes 2 Mb/s networks. * For 10 Mb/s Ethernet, a 1 MHz clock is appropriate. * For 10 Gb/s Ethernet, a 1 GHz clock should be ok, but * we also need to limit the resolution so that the u32 seq * overlaps less than one time per MSL (2 minutes). * Choosing a clock of 64 ns period is OK. (period of 274 s) */ return seq + (ktime_get_real_ns() >> 6); } #endif #if IS_ENABLED(CONFIG_IPV6) u32 secure_tcpv6_ts_off(const struct net *net, const __be32 *saddr, const __be32 *daddr) { const struct { struct in6_addr saddr; struct in6_addr daddr; } __aligned(SIPHASH_ALIGNMENT) combined = { .saddr = *(struct in6_addr *)saddr, .daddr = *(struct in6_addr *)daddr, }; if (READ_ONCE(net->ipv4.sysctl_tcp_timestamps) != 1) return 0; ts_secret_init(); return siphash(&combined, offsetofend(typeof(combined), daddr), &ts_secret); } EXPORT_IPV6_MOD(secure_tcpv6_ts_off); u32 secure_tcpv6_seq(const __be32 *saddr, const __be32 *daddr, __be16 sport, __be16 dport) { const struct { struct in6_addr saddr; struct in6_addr daddr; __be16 sport; __be16 dport; } __aligned(SIPHASH_ALIGNMENT) combined = { .saddr = *(struct in6_addr *)saddr, .daddr = *(struct in6_addr *)daddr, .sport = sport, .dport = dport }; u32 hash; net_secret_init(); hash = siphash(&combined, offsetofend(typeof(combined), dport), &net_secret); return seq_scale(hash); } EXPORT_SYMBOL(secure_tcpv6_seq); u64 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr, __be16 dport) { const struct { struct in6_addr saddr; struct in6_addr daddr; unsigned int timeseed; __be16 dport; } __aligned(SIPHASH_ALIGNMENT) combined = { .saddr = *(struct in6_addr *)saddr, .daddr = *(struct in6_addr *)daddr, .timeseed = jiffies / EPHEMERAL_PORT_SHUFFLE_PERIOD, .dport = dport, }; net_secret_init(); return siphash(&combined, offsetofend(typeof(combined), dport), &net_secret); } EXPORT_SYMBOL(secure_ipv6_port_ephemeral); #endif #ifdef CONFIG_INET u32 secure_tcp_ts_off(const struct net *net, __be32 saddr, __be32 daddr) { if (READ_ONCE(net->ipv4.sysctl_tcp_timestamps) != 1) return 0; ts_secret_init(); return siphash_2u32((__force u32)saddr, (__force u32)daddr, &ts_secret); } /* secure_tcp_seq_and_tsoff(a, b, 0, d) == secure_ipv4_port_ephemeral(a, b, d), * but fortunately, `sport' cannot be 0 in any circumstances. If this changes, * it would be easy enough to have the former function use siphash_4u32, passing * the arguments as separate u32. */ u32 secure_tcp_seq(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport) { u32 hash; net_secret_init(); hash = siphash_3u32((__force u32)saddr, (__force u32)daddr, (__force u32)sport << 16 | (__force u32)dport, &net_secret); return seq_scale(hash); } EXPORT_SYMBOL_GPL(secure_tcp_seq); u64 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport) { net_secret_init(); return siphash_4u32((__force u32)saddr, (__force u32)daddr, (__force u16)dport, jiffies / EPHEMERAL_PORT_SHUFFLE_PERIOD, &net_secret); } EXPORT_SYMBOL_GPL(secure_ipv4_port_ephemeral); #endif |
| 22 6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _ASM_X86_MSR_H #define _ASM_X86_MSR_H #include "msr-index.h" #ifndef __ASSEMBLER__ #include <asm/asm.h> #include <asm/errno.h> #include <asm/cpumask.h> #include <uapi/asm/msr.h> #include <asm/shared/msr.h> #include <linux/types.h> #include <linux/percpu.h> struct msr_info { u32 msr_no; struct msr reg; struct msr __percpu *msrs; int err; }; struct msr_regs_info { u32 *regs; int err; }; struct saved_msr { bool valid; struct msr_info info; }; struct saved_msrs { unsigned int num; struct saved_msr *array; }; /* * Be very careful with includes. This header is prone to include loops. */ #include <asm/atomic.h> #include <linux/tracepoint-defs.h> #ifdef CONFIG_TRACEPOINTS DECLARE_TRACEPOINT(read_msr); DECLARE_TRACEPOINT(write_msr); DECLARE_TRACEPOINT(rdpmc); extern void do_trace_write_msr(u32 msr, u64 val, int failed); extern void do_trace_read_msr(u32 msr, u64 val, int failed); extern void do_trace_rdpmc(u32 msr, u64 val, int failed); #else static inline void do_trace_write_msr(u32 msr, u64 val, int failed) {} static inline void do_trace_read_msr(u32 msr, u64 val, int failed) {} static inline void do_trace_rdpmc(u32 msr, u64 val, int failed) {} #endif /* * __rdmsr() and __wrmsr() are the two primitives which are the bare minimum MSR * accessors and should not have any tracing or other functionality piggybacking * on them - those are *purely* for accessing MSRs and nothing more. So don't even * think of extending them - you will be slapped with a stinking trout or a frozen * shark will reach you, wherever you are! You've been warned. */ static __always_inline u64 __rdmsr(u32 msr) { EAX_EDX_DECLARE_ARGS(val, low, high); asm volatile("1: rdmsr\n" "2:\n" _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_RDMSR) : EAX_EDX_RET(val, low, high) : "c" (msr)); return EAX_EDX_VAL(val, low, high); } static __always_inline void __wrmsrq(u32 msr, u64 val) { asm volatile("1: wrmsr\n" "2:\n" _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR) : : "c" (msr), "a" ((u32)val), "d" ((u32)(val >> 32)) : "memory"); } #define native_rdmsr(msr, val1, val2) \ do { \ u64 __val = __rdmsr((msr)); \ (void)((val1) = (u32)__val); \ (void)((val2) = (u32)(__val >> 32)); \ } while (0) static __always_inline u64 native_rdmsrq(u32 msr) { return __rdmsr(msr); } #define native_wrmsr(msr, low, high) \ __wrmsrq((msr), (u64)(high) << 32 | (low)) #define native_wrmsrq(msr, val) \ __wrmsrq((msr), (val)) static inline u64 native_read_msr(u32 msr) { u64 val; val = __rdmsr(msr); if (tracepoint_enabled(read_msr)) do_trace_read_msr(msr, val, 0); return val; } static inline int native_read_msr_safe(u32 msr, u64 *p) { int err; EAX_EDX_DECLARE_ARGS(val, low, high); asm volatile("1: rdmsr ; xor %[err],%[err]\n" "2:\n\t" _ASM_EXTABLE_TYPE_REG(1b, 2b, EX_TYPE_RDMSR_SAFE, %[err]) : [err] "=r" (err), EAX_EDX_RET(val, low, high) : "c" (msr)); if (tracepoint_enabled(read_msr)) do_trace_read_msr(msr, EAX_EDX_VAL(val, low, high), err); *p = EAX_EDX_VAL(val, low, high); return err; } /* Can be uninlined because referenced by paravirt */ static inline void notrace native_write_msr(u32 msr, u64 val) { native_wrmsrq(msr, val); if (tracepoint_enabled(write_msr)) do_trace_write_msr(msr, val, 0); } /* Can be uninlined because referenced by paravirt */ static inline int notrace native_write_msr_safe(u32 msr, u64 val) { int err; asm volatile("1: wrmsr ; xor %[err],%[err]\n" "2:\n\t" _ASM_EXTABLE_TYPE_REG(1b, 2b, EX_TYPE_WRMSR_SAFE, %[err]) : [err] "=a" (err) : "c" (msr), "0" ((u32)val), "d" ((u32)(val >> 32)) : "memory"); if (tracepoint_enabled(write_msr)) do_trace_write_msr(msr, val, err); return err; } extern int rdmsr_safe_regs(u32 regs[8]); extern int wrmsr_safe_regs(u32 regs[8]); static inline u64 native_read_pmc(int counter) { EAX_EDX_DECLARE_ARGS(val, low, high); asm volatile("rdpmc" : EAX_EDX_RET(val, low, high) : "c" (counter)); if (tracepoint_enabled(rdpmc)) do_trace_rdpmc(counter, EAX_EDX_VAL(val, low, high), 0); return EAX_EDX_VAL(val, low, high); } #ifdef CONFIG_PARAVIRT_XXL #include <asm/paravirt.h> #else #include <linux/errno.h> /* * Access to machine-specific registers (available on 586 and better only) * Note: the rd* operations modify the parameters directly (without using * pointer indirection), this allows gcc to optimize better */ #define rdmsr(msr, low, high) \ do { \ u64 __val = native_read_msr((msr)); \ (void)((low) = (u32)__val); \ (void)((high) = (u32)(__val >> 32)); \ } while (0) static inline void wrmsr(u32 msr, u32 low, u32 high) { native_write_msr(msr, (u64)high << 32 | low); } #define rdmsrq(msr, val) \ ((val) = native_read_msr((msr))) static inline void wrmsrq(u32 msr, u64 val) { native_write_msr(msr, val); } /* wrmsr with exception handling */ static inline int wrmsrq_safe(u32 msr, u64 val) { return native_write_msr_safe(msr, val); } /* rdmsr with exception handling */ #define rdmsr_safe(msr, low, high) \ ({ \ u64 __val; \ int __err = native_read_msr_safe((msr), &__val); \ (*low) = (u32)__val; \ (*high) = (u32)(__val >> 32); \ __err; \ }) static inline int rdmsrq_safe(u32 msr, u64 *p) { return native_read_msr_safe(msr, p); } static __always_inline u64 rdpmc(int counter) { return native_read_pmc(counter); } #endif /* !CONFIG_PARAVIRT_XXL */ /* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */ #define ASM_WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6) /* Non-serializing WRMSR, when available. Falls back to a serializing WRMSR. */ static __always_inline void wrmsrns(u32 msr, u64 val) { /* * WRMSR is 2 bytes. WRMSRNS is 3 bytes. Pad WRMSR with a redundant * DS prefix to avoid a trailing NOP. */ asm volatile("1: " ALTERNATIVE("ds wrmsr", ASM_WRMSRNS, X86_FEATURE_WRMSRNS) "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR) : : "c" (msr), "a" ((u32)val), "d" ((u32)(val >> 32))); } /* * Dual u32 version of wrmsrq_safe(): */ static inline int wrmsr_safe(u32 msr, u32 low, u32 high) { return wrmsrq_safe(msr, (u64)high << 32 | low); } struct msr __percpu *msrs_alloc(void); void msrs_free(struct msr __percpu *msrs); int msr_set_bit(u32 msr, u8 bit); int msr_clear_bit(u32 msr, u8 bit); #ifdef CONFIG_SMP int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h); int wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h); int rdmsrq_on_cpu(unsigned int cpu, u32 msr_no, u64 *q); int wrmsrq_on_cpu(unsigned int cpu, u32 msr_no, u64 q); void rdmsr_on_cpus(const struct cpumask *mask, u32 msr_no, struct msr __percpu *msrs); void wrmsr_on_cpus(const struct cpumask *mask, u32 msr_no, struct msr __percpu *msrs); int rdmsr_safe_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h); int wrmsr_safe_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h); int rdmsrq_safe_on_cpu(unsigned int cpu, u32 msr_no, u64 *q); int wrmsrq_safe_on_cpu(unsigned int cpu, u32 msr_no, u64 q); int rdmsr_safe_regs_on_cpu(unsigned int cpu, u32 regs[8]); int wrmsr_safe_regs_on_cpu(unsigned int cpu, u32 regs[8]); #else /* CONFIG_SMP */ static inline int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h) { rdmsr(msr_no, *l, *h); return 0; } static inline int wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h) { wrmsr(msr_no, l, h); return 0; } static inline int rdmsrq_on_cpu(unsigned int cpu, u32 msr_no, u64 *q) { rdmsrq(msr_no, *q); return 0; } static inline int wrmsrq_on_cpu(unsigned int cpu, u32 msr_no, u64 q) { wrmsrq(msr_no, q); return 0; } static inline void rdmsr_on_cpus(const struct cpumask *m, u32 msr_no, struct msr __percpu *msrs) { rdmsr_on_cpu(0, msr_no, raw_cpu_ptr(&msrs->l), raw_cpu_ptr(&msrs->h)); } static inline void wrmsr_on_cpus(const struct cpumask *m, u32 msr_no, struct msr __percpu *msrs) { wrmsr_on_cpu(0, msr_no, raw_cpu_read(msrs->l), raw_cpu_read(msrs->h)); } static inline int rdmsr_safe_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h) { return rdmsr_safe(msr_no, l, h); } static inline int wrmsr_safe_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h) { return wrmsr_safe(msr_no, l, h); } static inline int rdmsrq_safe_on_cpu(unsigned int cpu, u32 msr_no, u64 *q) { return rdmsrq_safe(msr_no, q); } static inline int wrmsrq_safe_on_cpu(unsigned int cpu, u32 msr_no, u64 q) { return wrmsrq_safe(msr_no, q); } static inline int rdmsr_safe_regs_on_cpu(unsigned int cpu, u32 regs[8]) { return rdmsr_safe_regs(regs); } static inline int wrmsr_safe_regs_on_cpu(unsigned int cpu, u32 regs[8]) { return wrmsr_safe_regs(regs); } #endif /* CONFIG_SMP */ /* Compatibility wrappers: */ #define rdmsrl(msr, val) rdmsrq(msr, val) #define wrmsrl(msr, val) wrmsrq(msr, val) #define rdmsrl_on_cpu(cpu, msr, q) rdmsrq_on_cpu(cpu, msr, q) #endif /* __ASSEMBLER__ */ #endif /* _ASM_X86_MSR_H */ |
| 8 8 2 2 2 2 2 2 6 6 8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 | // SPDX-License-Identifier: GPL-2.0-or-later /* drivers/net/ifb.c: The purpose of this driver is to provide a device that allows for sharing of resources: 1) qdiscs/policies that are per device as opposed to system wide. ifb allows for a device which can be redirected to thus providing an impression of sharing. 2) Allows for queueing incoming traffic for shaping instead of dropping. The original concept is based on what is known as the IMQ driver initially written by Martin Devera, later rewritten by Patrick McHardy and then maintained by Andre Correa. You need the tc action mirror or redirect to feed this device packets. Authors: Jamal Hadi Salim (2005) */ #include <linux/module.h> #include <linux/kernel.h> #include <linux/netdevice.h> #include <linux/ethtool.h> #include <linux/etherdevice.h> #include <linux/init.h> #include <linux/interrupt.h> #include <linux/moduleparam.h> #include <linux/netfilter_netdev.h> #include <net/pkt_sched.h> #include <net/net_namespace.h> #define TX_Q_LIMIT 32 struct ifb_q_stats { u64 packets; u64 bytes; struct u64_stats_sync sync; }; struct ifb_q_private { struct net_device *dev; struct tasklet_struct ifb_tasklet; int tasklet_pending; int txqnum; struct sk_buff_head rq; struct sk_buff_head tq; struct ifb_q_stats rx_stats; struct ifb_q_stats tx_stats; } ____cacheline_aligned_in_smp; struct ifb_dev_private { struct ifb_q_private *tx_private; }; /* For ethtools stats. */ struct ifb_q_stats_desc { char desc[ETH_GSTRING_LEN]; size_t offset; }; #define IFB_Q_STAT(m) offsetof(struct ifb_q_stats, m) static const struct ifb_q_stats_desc ifb_q_stats_desc[] = { { "packets", IFB_Q_STAT(packets) }, { "bytes", IFB_Q_STAT(bytes) }, }; #define IFB_Q_STATS_LEN ARRAY_SIZE(ifb_q_stats_desc) static netdev_tx_t ifb_xmit(struct sk_buff *skb, struct net_device *dev); static int ifb_open(struct net_device *dev); static int ifb_close(struct net_device *dev); static void ifb_update_q_stats(struct ifb_q_stats *stats, int len) { u64_stats_update_begin(&stats->sync); stats->packets++; stats->bytes += len; u64_stats_update_end(&stats->sync); } static void ifb_ri_tasklet(struct tasklet_struct *t) { struct ifb_q_private *txp = from_tasklet(txp, t, ifb_tasklet); struct netdev_queue *txq; struct sk_buff *skb; txq = netdev_get_tx_queue(txp->dev, txp->txqnum); skb = skb_peek(&txp->tq); if (!skb) { if (!__netif_tx_trylock(txq)) goto resched; skb_queue_splice_tail_init(&txp->rq, &txp->tq); __netif_tx_unlock(txq); } while ((skb = __skb_dequeue(&txp->tq)) != NULL) { /* Skip tc and netfilter to prevent redirection loop. */ skb->redirected = 0; #ifdef CONFIG_NET_CLS_ACT skb->tc_skip_classify = 1; #endif nf_skip_egress(skb, true); ifb_update_q_stats(&txp->tx_stats, skb->len); rcu_read_lock(); skb->dev = dev_get_by_index_rcu(dev_net(txp->dev), skb->skb_iif); if (!skb->dev) { rcu_read_unlock(); dev_kfree_skb(skb); txp->dev->stats.tx_dropped++; if (skb_queue_len(&txp->tq) != 0) goto resched; break; } rcu_read_unlock(); skb->skb_iif = txp->dev->ifindex; if (!skb->from_ingress) { dev_queue_xmit(skb); } else { skb_pull_rcsum(skb, skb->mac_len); netif_receive_skb(skb); } } if (__netif_tx_trylock(txq)) { skb = skb_peek(&txp->rq); if (!skb) { txp->tasklet_pending = 0; if (netif_tx_queue_stopped(txq)) netif_tx_wake_queue(txq); } else { __netif_tx_unlock(txq); goto resched; } __netif_tx_unlock(txq); } else { resched: txp->tasklet_pending = 1; tasklet_schedule(&txp->ifb_tasklet); } } static void ifb_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats) { struct ifb_dev_private *dp = netdev_priv(dev); struct ifb_q_private *txp = dp->tx_private; unsigned int start; u64 packets, bytes; int i; for (i = 0; i < dev->num_tx_queues; i++,txp++) { do { start = u64_stats_fetch_begin(&txp->rx_stats.sync); packets = txp->rx_stats.packets; bytes = txp->rx_stats.bytes; } while (u64_stats_fetch_retry(&txp->rx_stats.sync, start)); stats->rx_packets += packets; stats->rx_bytes += bytes; do { start = u64_stats_fetch_begin(&txp->tx_stats.sync); packets = txp->tx_stats.packets; bytes = txp->tx_stats.bytes; } while (u64_stats_fetch_retry(&txp->tx_stats.sync, start)); stats->tx_packets += packets; stats->tx_bytes += bytes; } stats->rx_dropped = dev->stats.rx_dropped; stats->tx_dropped = dev->stats.tx_dropped; } static int ifb_dev_init(struct net_device *dev) { struct ifb_dev_private *dp = netdev_priv(dev); struct ifb_q_private *txp; int i; txp = kcalloc(dev->num_tx_queues, sizeof(*txp), GFP_KERNEL); if (!txp) return -ENOMEM; dp->tx_private = txp; for (i = 0; i < dev->num_tx_queues; i++,txp++) { txp->txqnum = i; txp->dev = dev; __skb_queue_head_init(&txp->rq); __skb_queue_head_init(&txp->tq); u64_stats_init(&txp->rx_stats.sync); u64_stats_init(&txp->tx_stats.sync); tasklet_setup(&txp->ifb_tasklet, ifb_ri_tasklet); netif_tx_start_queue(netdev_get_tx_queue(dev, i)); } return 0; } static void ifb_get_strings(struct net_device *dev, u32 stringset, u8 *buf) { u8 *p = buf; int i, j; switch (stringset) { case ETH_SS_STATS: for (i = 0; i < dev->real_num_rx_queues; i++) for (j = 0; j < IFB_Q_STATS_LEN; j++) ethtool_sprintf(&p, "rx_queue_%u_%.18s", i, ifb_q_stats_desc[j].desc); for (i = 0; i < dev->real_num_tx_queues; i++) for (j = 0; j < IFB_Q_STATS_LEN; j++) ethtool_sprintf(&p, "tx_queue_%u_%.18s", i, ifb_q_stats_desc[j].desc); break; } } static int ifb_get_sset_count(struct net_device *dev, int sset) { switch (sset) { case ETH_SS_STATS: return IFB_Q_STATS_LEN * (dev->real_num_rx_queues + dev->real_num_tx_queues); default: return -EOPNOTSUPP; } } static void ifb_fill_stats_data(u64 **data, struct ifb_q_stats *q_stats) { void *stats_base = (void *)q_stats; unsigned int start; size_t offset; int j; do { start = u64_stats_fetch_begin(&q_stats->sync); for (j = 0; j < IFB_Q_STATS_LEN; j++) { offset = ifb_q_stats_desc[j].offset; (*data)[j] = *(u64 *)(stats_base + offset); } } while (u64_stats_fetch_retry(&q_stats->sync, start)); *data += IFB_Q_STATS_LEN; } static void ifb_get_ethtool_stats(struct net_device *dev, struct ethtool_stats *stats, u64 *data) { struct ifb_dev_private *dp = netdev_priv(dev); struct ifb_q_private *txp; int i; for (i = 0; i < dev->real_num_rx_queues; i++) { txp = dp->tx_private + i; ifb_fill_stats_data(&data, &txp->rx_stats); } for (i = 0; i < dev->real_num_tx_queues; i++) { txp = dp->tx_private + i; ifb_fill_stats_data(&data, &txp->tx_stats); } } static const struct net_device_ops ifb_netdev_ops = { .ndo_open = ifb_open, .ndo_stop = ifb_close, .ndo_get_stats64 = ifb_stats64, .ndo_start_xmit = ifb_xmit, .ndo_validate_addr = eth_validate_addr, .ndo_init = ifb_dev_init, }; static const struct ethtool_ops ifb_ethtool_ops = { .get_strings = ifb_get_strings, .get_sset_count = ifb_get_sset_count, .get_ethtool_stats = ifb_get_ethtool_stats, }; #define IFB_FEATURES (NETIF_F_HW_CSUM | NETIF_F_SG | NETIF_F_FRAGLIST | \ NETIF_F_GSO_SOFTWARE | NETIF_F_GSO_ENCAP_ALL | \ NETIF_F_HIGHDMA | NETIF_F_HW_VLAN_CTAG_TX | \ NETIF_F_HW_VLAN_STAG_TX) static void ifb_dev_free(struct net_device *dev) { struct ifb_dev_private *dp = netdev_priv(dev); struct ifb_q_private *txp = dp->tx_private; int i; for (i = 0; i < dev->num_tx_queues; i++,txp++) { tasklet_kill(&txp->ifb_tasklet); __skb_queue_purge(&txp->rq); __skb_queue_purge(&txp->tq); } kfree(dp->tx_private); } static void ifb_setup(struct net_device *dev) { /* Initialize the device structure. */ dev->netdev_ops = &ifb_netdev_ops; dev->ethtool_ops = &ifb_ethtool_ops; /* Fill in device structure with ethernet-generic values. */ ether_setup(dev); dev->tx_queue_len = TX_Q_LIMIT; dev->features |= IFB_FEATURES; dev->hw_features |= dev->features; dev->hw_enc_features |= dev->features; dev->vlan_features |= IFB_FEATURES & ~(NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_STAG_TX); dev->flags |= IFF_NOARP; dev->flags &= ~IFF_MULTICAST; dev->priv_flags &= ~IFF_TX_SKB_SHARING; netif_keep_dst(dev); eth_hw_addr_random(dev); dev->needs_free_netdev = true; dev->priv_destructor = ifb_dev_free; dev->min_mtu = 0; dev->max_mtu = 0; netif_set_tso_max_size(dev, GSO_MAX_SIZE); } static netdev_tx_t ifb_xmit(struct sk_buff *skb, struct net_device *dev) { struct ifb_dev_private *dp = netdev_priv(dev); struct ifb_q_private *txp = dp->tx_private + skb_get_queue_mapping(skb); ifb_update_q_stats(&txp->rx_stats, skb->len); if (!skb->redirected || !skb->skb_iif) { dev_kfree_skb(skb); dev->stats.rx_dropped++; return NETDEV_TX_OK; } if (skb_queue_len(&txp->rq) >= dev->tx_queue_len) netif_tx_stop_queue(netdev_get_tx_queue(dev, txp->txqnum)); __skb_queue_tail(&txp->rq, skb); if (!txp->tasklet_pending) { txp->tasklet_pending = 1; tasklet_schedule(&txp->ifb_tasklet); } return NETDEV_TX_OK; } static int ifb_close(struct net_device *dev) { netif_tx_stop_all_queues(dev); return 0; } static int ifb_open(struct net_device *dev) { netif_tx_start_all_queues(dev); return 0; } static int ifb_validate(struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) { if (tb[IFLA_ADDRESS]) { if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN) return -EINVAL; if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS]))) return -EADDRNOTAVAIL; } return 0; } static struct rtnl_link_ops ifb_link_ops __read_mostly = { .kind = "ifb", .priv_size = sizeof(struct ifb_dev_private), .setup = ifb_setup, .validate = ifb_validate, }; /* Number of ifb devices to be set up by this module. * Note that these legacy devices have one queue. * Prefer something like : ip link add ifb10 numtxqueues 8 type ifb */ static int numifbs = 2; module_param(numifbs, int, 0); MODULE_PARM_DESC(numifbs, "Number of ifb devices"); static int __init ifb_init_one(int index) { struct net_device *dev_ifb; int err; dev_ifb = alloc_netdev(sizeof(struct ifb_dev_private), "ifb%d", NET_NAME_UNKNOWN, ifb_setup); if (!dev_ifb) return -ENOMEM; dev_ifb->rtnl_link_ops = &ifb_link_ops; err = register_netdevice(dev_ifb); if (err < 0) goto err; return 0; err: free_netdev(dev_ifb); return err; } static int __init ifb_init_module(void) { int i, err; err = rtnl_link_register(&ifb_link_ops); if (err < 0) return err; rtnl_net_lock(&init_net); for (i = 0; i < numifbs && !err; i++) { err = ifb_init_one(i); cond_resched(); } rtnl_net_unlock(&init_net); if (err) rtnl_link_unregister(&ifb_link_ops); return err; } static void __exit ifb_cleanup_module(void) { rtnl_link_unregister(&ifb_link_ops); } module_init(ifb_init_module); module_exit(ifb_cleanup_module); MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("Intermediate Functional Block (ifb) netdevice driver for sharing of resources and ingress packet queuing"); MODULE_AUTHOR("Jamal Hadi Salim"); MODULE_ALIAS_RTNL_LINK("ifb"); |
| 1109 1109 1158 1109 1125 35 58 58 58 58 1222 1164 1027 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 | // SPDX-License-Identifier: GPL-2.0-only /* * 9P entry point * * Copyright (C) 2007 by Latchesar Ionkov <lucho@ionkov.net> * Copyright (C) 2004 by Eric Van Hensbergen <ericvh@gmail.com> * Copyright (C) 2002 by Ron Minnich <rminnich@lanl.gov> */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/module.h> #include <linux/kmod.h> #include <linux/errno.h> #include <linux/sched.h> #include <linux/moduleparam.h> #include <net/9p/9p.h> #include <linux/fs.h> #include <linux/parser.h> #include <net/9p/client.h> #include <net/9p/transport.h> #include <linux/list.h> #include <linux/spinlock.h> #ifdef CONFIG_NET_9P_DEBUG unsigned int p9_debug_level; /* feature-rific global debug level */ EXPORT_SYMBOL(p9_debug_level); module_param_named(debug, p9_debug_level, uint, 0); MODULE_PARM_DESC(debug, "9P debugging level"); void _p9_debug(enum p9_debug_flags level, const char *func, const char *fmt, ...) { struct va_format vaf; va_list args; if ((p9_debug_level & level) != level) return; va_start(args, fmt); vaf.fmt = fmt; vaf.va = &args; if (level == P9_DEBUG_9P) pr_notice("(%8.8d) %pV", task_pid_nr(current), &vaf); else pr_notice("-- %s (%d): %pV", func, task_pid_nr(current), &vaf); va_end(args); } EXPORT_SYMBOL(_p9_debug); #endif /* Dynamic Transport Registration Routines */ static DEFINE_SPINLOCK(v9fs_trans_lock); static LIST_HEAD(v9fs_trans_list); /** * v9fs_register_trans - register a new transport with 9p * @m: structure describing the transport module and entry points * */ void v9fs_register_trans(struct p9_trans_module *m) { spin_lock(&v9fs_trans_lock); list_add_tail(&m->list, &v9fs_trans_list); spin_unlock(&v9fs_trans_lock); } EXPORT_SYMBOL(v9fs_register_trans); /** * v9fs_unregister_trans - unregister a 9p transport * @m: the transport to remove * */ void v9fs_unregister_trans(struct p9_trans_module *m) { spin_lock(&v9fs_trans_lock); list_del_init(&m->list); spin_unlock(&v9fs_trans_lock); } EXPORT_SYMBOL(v9fs_unregister_trans); static struct p9_trans_module *_p9_get_trans_by_name(const char *s) { struct p9_trans_module *t, *found = NULL; spin_lock(&v9fs_trans_lock); list_for_each_entry(t, &v9fs_trans_list, list) if (strcmp(t->name, s) == 0 && try_module_get(t->owner)) { found = t; break; } spin_unlock(&v9fs_trans_lock); return found; } /** * v9fs_get_trans_by_name - get transport with the matching name * @s: string identifying transport * */ struct p9_trans_module *v9fs_get_trans_by_name(const char *s) { struct p9_trans_module *found = NULL; found = _p9_get_trans_by_name(s); #ifdef CONFIG_MODULES if (!found) { request_module("9p-%s", s); found = _p9_get_trans_by_name(s); } #endif return found; } EXPORT_SYMBOL(v9fs_get_trans_by_name); static const char * const v9fs_default_transports[] = { "virtio", "tcp", "fd", "unix", "xen", "rdma", }; /** * v9fs_get_default_trans - get the default transport * */ struct p9_trans_module *v9fs_get_default_trans(void) { struct p9_trans_module *t, *found = NULL; int i; spin_lock(&v9fs_trans_lock); list_for_each_entry(t, &v9fs_trans_list, list) if (t->def && try_module_get(t->owner)) { found = t; break; } if (!found) list_for_each_entry(t, &v9fs_trans_list, list) if (try_module_get(t->owner)) { found = t; break; } spin_unlock(&v9fs_trans_lock); for (i = 0; !found && i < ARRAY_SIZE(v9fs_default_transports); i++) found = v9fs_get_trans_by_name(v9fs_default_transports[i]); return found; } EXPORT_SYMBOL(v9fs_get_default_trans); /** * v9fs_put_trans - put trans * @m: transport to put * */ void v9fs_put_trans(struct p9_trans_module *m) { if (m) module_put(m->owner); } /** * init_p9 - Initialize module * */ static int __init init_p9(void) { int ret; ret = p9_client_init(); if (ret) return ret; p9_error_init(); pr_info("Installing 9P2000 support\n"); return ret; } /** * exit_p9 - shutdown module * */ static void __exit exit_p9(void) { pr_info("Unloading 9P2000 support\n"); p9_client_exit(); } module_init(init_p9) module_exit(exit_p9) MODULE_AUTHOR("Latchesar Ionkov <lucho@ionkov.net>"); MODULE_AUTHOR("Eric Van Hensbergen <ericvh@gmail.com>"); MODULE_AUTHOR("Ron Minnich <rminnich@lanl.gov>"); MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("Plan 9 Resource Sharing Support (9P2000)"); |
| 113 113 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | // SPDX-License-Identifier: GPL-2.0-or-later /* mpihelp-add_1.c - MPI helper functions * Copyright (C) 1994, 1996, 1997, 1998, * 2000 Free Software Foundation, Inc. * * This file is part of GnuPG. * * Note: This code is heavily based on the GNU MP Library. * Actually it's the same code with only minor changes in the * way the data is stored; this is to support the abstraction * of an optional secure memory allocation which may be used * to avoid revealing of sensitive data due to paging etc. * The GNU MP Library itself is published under the LGPL; * however I decided to publish this code under the plain GPL. */ #include "mpi-internal.h" #include "longlong.h" mpi_limb_t mpihelp_add_n(mpi_ptr_t res_ptr, mpi_ptr_t s1_ptr, mpi_ptr_t s2_ptr, mpi_size_t size) { mpi_limb_t x, y, cy; mpi_size_t j; /* The loop counter and index J goes from -SIZE to -1. This way the loop becomes faster. */ j = -size; /* Offset the base pointers to compensate for the negative indices. */ s1_ptr -= j; s2_ptr -= j; res_ptr -= j; cy = 0; do { y = s2_ptr[j]; x = s1_ptr[j]; y += cy; /* add previous carry to one addend */ cy = y < cy; /* get out carry from that addition */ y += x; /* add other addend */ cy += y < x; /* get out carry from that add, combine */ res_ptr[j] = y; } while (++j); return cy; } |
| 2421 88 602 823 1125 1003 602 19505 1841 19505 24185 3573 19503 20566 1380 1380 20081 20077 24151 21403 10131 13448 18249 3339 1792 23274 21468 1874 21114 2529 2667 90 2084 18818 18842 36 36 18535 188 575 12856 7 1 34 12959 18818 18862 18818 262 11 174 324 720 1770 125 42 898 17238 8 303 1684 3397 3398 192 1463 8 2204 1216 10 348 3226 205 10 2669 1279 3402 13 192 396 2373 702 1232 88 88 162 163 162 2 163 75 163 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Resizable, Scalable, Concurrent Hash Table * * Copyright (c) 2015-2016 Herbert Xu <herbert@gondor.apana.org.au> * Copyright (c) 2014-2015 Thomas Graf <tgraf@suug.ch> * Copyright (c) 2008-2014 Patrick McHardy <kaber@trash.net> * * Code partially derived from nft_hash * Rewritten with rehash code from br_multicast plus single list * pointer as suggested by Josh Triplett * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as * published by the Free Software Foundation. */ #ifndef _LINUX_RHASHTABLE_H #define _LINUX_RHASHTABLE_H #include <linux/err.h> #include <linux/errno.h> #include <linux/jhash.h> #include <linux/list_nulls.h> #include <linux/workqueue.h> #include <linux/rculist.h> #include <linux/bit_spinlock.h> #include <linux/rhashtable-types.h> /* * Objects in an rhashtable have an embedded struct rhash_head * which is linked into as hash chain from the hash table - or one * of two or more hash tables when the rhashtable is being resized. * The end of the chain is marked with a special nulls marks which has * the least significant bit set but otherwise stores the address of * the hash bucket. This allows us to be sure we've found the end * of the right list. * The value stored in the hash bucket has BIT(0) used as a lock bit. * This bit must be atomically set before any changes are made to * the chain. To avoid dereferencing this pointer without clearing * the bit first, we use an opaque 'struct rhash_lock_head *' for the * pointer stored in the bucket. This struct needs to be defined so * that rcu_dereference() works on it, but it has no content so a * cast is needed for it to be useful. This ensures it isn't * used by mistake with clearing the lock bit first. */ struct rhash_lock_head {}; /* Maximum chain length before rehash * * The maximum (not average) chain length grows with the size of the hash * table, at a rate of (log N)/(log log N). * * The value of 16 is selected so that even if the hash table grew to * 2^32 you would not expect the maximum chain length to exceed it * unless we are under attack (or extremely unlucky). * * As this limit is only to detect attacks, we don't need to set it to a * lower value as you'd need the chain length to vastly exceed 16 to have * any real effect on the system. */ #define RHT_ELASTICITY 16u /** * struct bucket_table - Table of hash buckets * @size: Number of hash buckets * @nest: Number of bits of first-level nested table. * @rehash: Current bucket being rehashed * @hash_rnd: Random seed to fold into hash * @walkers: List of active walkers * @rcu: RCU structure for freeing the table * @future_tbl: Table under construction during rehashing * @ntbl: Nested table used when out of memory. * @buckets: size * hash buckets */ struct bucket_table { unsigned int size; unsigned int nest; u32 hash_rnd; struct list_head walkers; struct rcu_head rcu; struct bucket_table __rcu *future_tbl; struct lockdep_map dep_map; struct rhash_lock_head __rcu *buckets[] ____cacheline_aligned_in_smp; }; /* * NULLS_MARKER() expects a hash value with the low * bits mostly likely to be significant, and it discards * the msb. * We give it an address, in which the bottom bit is * always 0, and the msb might be significant. * So we shift the address down one bit to align with * expectations and avoid losing a significant bit. * * We never store the NULLS_MARKER in the hash table * itself as we need the lsb for locking. * Instead we store a NULL */ #define RHT_NULLS_MARKER(ptr) \ ((void *)NULLS_MARKER(((unsigned long) (ptr)) >> 1)) #define INIT_RHT_NULLS_HEAD(ptr) \ ((ptr) = NULL) static inline bool rht_is_a_nulls(const struct rhash_head *ptr) { return ((unsigned long) ptr & 1); } static inline void *rht_obj(const struct rhashtable *ht, const struct rhash_head *he) { return (char *)he - ht->p.head_offset; } static inline unsigned int rht_bucket_index(const struct bucket_table *tbl, unsigned int hash) { return hash & (tbl->size - 1); } static inline unsigned int rht_key_get_hash(struct rhashtable *ht, const void *key, const struct rhashtable_params params, unsigned int hash_rnd) { unsigned int hash; /* params must be equal to ht->p if it isn't constant. */ if (!__builtin_constant_p(params.key_len)) hash = ht->p.hashfn(key, ht->key_len, hash_rnd); else if (params.key_len) { unsigned int key_len = params.key_len; if (params.hashfn) hash = params.hashfn(key, key_len, hash_rnd); else if (key_len & (sizeof(u32) - 1)) hash = jhash(key, key_len, hash_rnd); else hash = jhash2(key, key_len / sizeof(u32), hash_rnd); } else { unsigned int key_len = ht->p.key_len; if (params.hashfn) hash = params.hashfn(key, key_len, hash_rnd); else hash = jhash(key, key_len, hash_rnd); } return hash; } static inline unsigned int rht_key_hashfn( struct rhashtable *ht, const struct bucket_table *tbl, const void *key, const struct rhashtable_params params) { unsigned int hash = rht_key_get_hash(ht, key, params, tbl->hash_rnd); return rht_bucket_index(tbl, hash); } static inline unsigned int rht_head_hashfn( struct rhashtable *ht, const struct bucket_table *tbl, const struct rhash_head *he, const struct rhashtable_params params) { const char *ptr = rht_obj(ht, he); return likely(params.obj_hashfn) ? rht_bucket_index(tbl, params.obj_hashfn(ptr, params.key_len ?: ht->p.key_len, tbl->hash_rnd)) : rht_key_hashfn(ht, tbl, ptr + params.key_offset, params); } /** * rht_grow_above_75 - returns true if nelems > 0.75 * table-size * @ht: hash table * @tbl: current table */ static inline bool rht_grow_above_75(const struct rhashtable *ht, const struct bucket_table *tbl) { /* Expand table when exceeding 75% load */ return atomic_read(&ht->nelems) > (tbl->size / 4 * 3) && (!ht->p.max_size || tbl->size < ht->p.max_size); } /** * rht_shrink_below_30 - returns true if nelems < 0.3 * table-size * @ht: hash table * @tbl: current table */ static inline bool rht_shrink_below_30(const struct rhashtable *ht, const struct bucket_table *tbl) { /* Shrink table beneath 30% load */ return atomic_read(&ht->nelems) < (tbl->size * 3 / 10) && tbl->size > ht->p.min_size; } /** * rht_grow_above_100 - returns true if nelems > table-size * @ht: hash table * @tbl: current table */ static inline bool rht_grow_above_100(const struct rhashtable *ht, const struct bucket_table *tbl) { return atomic_read(&ht->nelems) > tbl->size && (!ht->p.max_size || tbl->size < ht->p.max_size); } /** * rht_grow_above_max - returns true if table is above maximum * @ht: hash table * @tbl: current table */ static inline bool rht_grow_above_max(const struct rhashtable *ht, const struct bucket_table *tbl) { return atomic_read(&ht->nelems) >= ht->max_elems; } #ifdef CONFIG_PROVE_LOCKING int lockdep_rht_mutex_is_held(struct rhashtable *ht); int lockdep_rht_bucket_is_held(const struct bucket_table *tbl, u32 hash); #else static inline int lockdep_rht_mutex_is_held(struct rhashtable *ht) { return 1; } static inline int lockdep_rht_bucket_is_held(const struct bucket_table *tbl, u32 hash) { return 1; } #endif /* CONFIG_PROVE_LOCKING */ void *rhashtable_insert_slow(struct rhashtable *ht, const void *key, struct rhash_head *obj); void rhashtable_walk_enter(struct rhashtable *ht, struct rhashtable_iter *iter); void rhashtable_walk_exit(struct rhashtable_iter *iter); int rhashtable_walk_start_check(struct rhashtable_iter *iter) __acquires(RCU); static inline void rhashtable_walk_start(struct rhashtable_iter *iter) { (void)rhashtable_walk_start_check(iter); } void *rhashtable_walk_next(struct rhashtable_iter *iter); void *rhashtable_walk_peek(struct rhashtable_iter *iter); void rhashtable_walk_stop(struct rhashtable_iter *iter) __releases(RCU); void rhashtable_free_and_destroy(struct rhashtable *ht, void (*free_fn)(void *ptr, void *arg), void *arg); void rhashtable_destroy(struct rhashtable *ht); struct rhash_lock_head __rcu **rht_bucket_nested( const struct bucket_table *tbl, unsigned int hash); struct rhash_lock_head __rcu **__rht_bucket_nested( const struct bucket_table *tbl, unsigned int hash); struct rhash_lock_head __rcu **rht_bucket_nested_insert( struct rhashtable *ht, struct bucket_table *tbl, unsigned int hash); #define rht_dereference(p, ht) \ rcu_dereference_protected(p, lockdep_rht_mutex_is_held(ht)) #define rht_dereference_rcu(p, ht) \ rcu_dereference_check(p, lockdep_rht_mutex_is_held(ht)) #define rht_dereference_bucket(p, tbl, hash) \ rcu_dereference_protected(p, lockdep_rht_bucket_is_held(tbl, hash)) #define rht_dereference_bucket_rcu(p, tbl, hash) \ rcu_dereference_check(p, lockdep_rht_bucket_is_held(tbl, hash)) #define rht_entry(tpos, pos, member) \ ({ tpos = container_of(pos, typeof(*tpos), member); 1; }) static inline struct rhash_lock_head __rcu *const *rht_bucket( const struct bucket_table *tbl, unsigned int hash) { return unlikely(tbl->nest) ? rht_bucket_nested(tbl, hash) : &tbl->buckets[hash]; } static inline struct rhash_lock_head __rcu **rht_bucket_var( struct bucket_table *tbl, unsigned int hash) { return unlikely(tbl->nest) ? __rht_bucket_nested(tbl, hash) : &tbl->buckets[hash]; } static inline struct rhash_lock_head __rcu **rht_bucket_insert( struct rhashtable *ht, struct bucket_table *tbl, unsigned int hash) { return unlikely(tbl->nest) ? rht_bucket_nested_insert(ht, tbl, hash) : &tbl->buckets[hash]; } /* * We lock a bucket by setting BIT(0) in the pointer - this is always * zero in real pointers. The NULLS mark is never stored in the bucket, * rather we store NULL if the bucket is empty. * bit_spin_locks do not handle contention well, but the whole point * of the hashtable design is to achieve minimum per-bucket contention. * A nested hash table might not have a bucket pointer. In that case * we cannot get a lock. For remove and replace the bucket cannot be * interesting and doesn't need locking. * For insert we allocate the bucket if this is the last bucket_table, * and then take the lock. * Sometimes we unlock a bucket by writing a new pointer there. In that * case we don't need to unlock, but we do need to reset state such as * local_bh. For that we have rht_assign_unlock(). As rcu_assign_pointer() * provides the same release semantics that bit_spin_unlock() provides, * this is safe. * When we write to a bucket without unlocking, we use rht_assign_locked(). */ static inline unsigned long rht_lock(struct bucket_table *tbl, struct rhash_lock_head __rcu **bkt) { unsigned long flags; local_irq_save(flags); bit_spin_lock(0, (unsigned long *)bkt); lock_map_acquire(&tbl->dep_map); return flags; } static inline unsigned long rht_lock_nested(struct bucket_table *tbl, struct rhash_lock_head __rcu **bucket, unsigned int subclass) { unsigned long flags; local_irq_save(flags); bit_spin_lock(0, (unsigned long *)bucket); lock_acquire_exclusive(&tbl->dep_map, subclass, 0, NULL, _THIS_IP_); return flags; } static inline void rht_unlock(struct bucket_table *tbl, struct rhash_lock_head __rcu **bkt, unsigned long flags) { lock_map_release(&tbl->dep_map); bit_spin_unlock(0, (unsigned long *)bkt); local_irq_restore(flags); } static inline struct rhash_head *__rht_ptr( struct rhash_lock_head *p, struct rhash_lock_head __rcu *const *bkt) { return (struct rhash_head *) ((unsigned long)p & ~BIT(0) ?: (unsigned long)RHT_NULLS_MARKER(bkt)); } /* * Where 'bkt' is a bucket and might be locked: * rht_ptr_rcu() dereferences that pointer and clears the lock bit. * rht_ptr() dereferences in a context where the bucket is locked. * rht_ptr_exclusive() dereferences in a context where exclusive * access is guaranteed, such as when destroying the table. */ static inline struct rhash_head *rht_ptr_rcu( struct rhash_lock_head __rcu *const *bkt) { return __rht_ptr(rcu_dereference(*bkt), bkt); } static inline struct rhash_head *rht_ptr( struct rhash_lock_head __rcu *const *bkt, struct bucket_table *tbl, unsigned int hash) { return __rht_ptr(rht_dereference_bucket(*bkt, tbl, hash), bkt); } static inline struct rhash_head *rht_ptr_exclusive( struct rhash_lock_head __rcu *const *bkt) { return __rht_ptr(rcu_dereference_protected(*bkt, 1), bkt); } static inline void rht_assign_locked(struct rhash_lock_head __rcu **bkt, struct rhash_head *obj) { if (rht_is_a_nulls(obj)) obj = NULL; rcu_assign_pointer(*bkt, (void *)((unsigned long)obj | BIT(0))); } static inline void rht_assign_unlock(struct bucket_table *tbl, struct rhash_lock_head __rcu **bkt, struct rhash_head *obj, unsigned long flags) { if (rht_is_a_nulls(obj)) obj = NULL; lock_map_release(&tbl->dep_map); rcu_assign_pointer(*bkt, (void *)obj); preempt_enable(); __release(bitlock); local_irq_restore(flags); } /** * rht_for_each_from - iterate over hash chain from given head * @pos: the &struct rhash_head to use as a loop cursor. * @head: the &struct rhash_head to start from * @tbl: the &struct bucket_table * @hash: the hash value / bucket index */ #define rht_for_each_from(pos, head, tbl, hash) \ for (pos = head; \ !rht_is_a_nulls(pos); \ pos = rht_dereference_bucket((pos)->next, tbl, hash)) /** * rht_for_each - iterate over hash chain * @pos: the &struct rhash_head to use as a loop cursor. * @tbl: the &struct bucket_table * @hash: the hash value / bucket index */ #define rht_for_each(pos, tbl, hash) \ rht_for_each_from(pos, rht_ptr(rht_bucket(tbl, hash), tbl, hash), \ tbl, hash) /** * rht_for_each_entry_from - iterate over hash chain from given head * @tpos: the type * to use as a loop cursor. * @pos: the &struct rhash_head to use as a loop cursor. * @head: the &struct rhash_head to start from * @tbl: the &struct bucket_table * @hash: the hash value / bucket index * @member: name of the &struct rhash_head within the hashable struct. */ #define rht_for_each_entry_from(tpos, pos, head, tbl, hash, member) \ for (pos = head; \ (!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member); \ pos = rht_dereference_bucket((pos)->next, tbl, hash)) /** * rht_for_each_entry - iterate over hash chain of given type * @tpos: the type * to use as a loop cursor. * @pos: the &struct rhash_head to use as a loop cursor. * @tbl: the &struct bucket_table * @hash: the hash value / bucket index * @member: name of the &struct rhash_head within the hashable struct. */ #define rht_for_each_entry(tpos, pos, tbl, hash, member) \ rht_for_each_entry_from(tpos, pos, \ rht_ptr(rht_bucket(tbl, hash), tbl, hash), \ tbl, hash, member) /** * rht_for_each_entry_safe - safely iterate over hash chain of given type * @tpos: the type * to use as a loop cursor. * @pos: the &struct rhash_head to use as a loop cursor. * @next: the &struct rhash_head to use as next in loop cursor. * @tbl: the &struct bucket_table * @hash: the hash value / bucket index * @member: name of the &struct rhash_head within the hashable struct. * * This hash chain list-traversal primitive allows for the looped code to * remove the loop cursor from the list. */ #define rht_for_each_entry_safe(tpos, pos, next, tbl, hash, member) \ for (pos = rht_ptr(rht_bucket(tbl, hash), tbl, hash), \ next = !rht_is_a_nulls(pos) ? \ rht_dereference_bucket(pos->next, tbl, hash) : NULL; \ (!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member); \ pos = next, \ next = !rht_is_a_nulls(pos) ? \ rht_dereference_bucket(pos->next, tbl, hash) : NULL) /** * rht_for_each_rcu_from - iterate over rcu hash chain from given head * @pos: the &struct rhash_head to use as a loop cursor. * @head: the &struct rhash_head to start from * @tbl: the &struct bucket_table * @hash: the hash value / bucket index * * This hash chain list-traversal primitive may safely run concurrently with * the _rcu mutation primitives such as rhashtable_insert() as long as the * traversal is guarded by rcu_read_lock(). */ #define rht_for_each_rcu_from(pos, head, tbl, hash) \ for (({barrier(); }), \ pos = head; \ !rht_is_a_nulls(pos); \ pos = rcu_dereference_raw(pos->next)) /** * rht_for_each_rcu - iterate over rcu hash chain * @pos: the &struct rhash_head to use as a loop cursor. * @tbl: the &struct bucket_table * @hash: the hash value / bucket index * * This hash chain list-traversal primitive may safely run concurrently with * the _rcu mutation primitives such as rhashtable_insert() as long as the * traversal is guarded by rcu_read_lock(). */ #define rht_for_each_rcu(pos, tbl, hash) \ for (({barrier(); }), \ pos = rht_ptr_rcu(rht_bucket(tbl, hash)); \ !rht_is_a_nulls(pos); \ pos = rcu_dereference_raw(pos->next)) /** * rht_for_each_entry_rcu_from - iterated over rcu hash chain from given head * @tpos: the type * to use as a loop cursor. * @pos: the &struct rhash_head to use as a loop cursor. * @head: the &struct rhash_head to start from * @tbl: the &struct bucket_table * @hash: the hash value / bucket index * @member: name of the &struct rhash_head within the hashable struct. * * This hash chain list-traversal primitive may safely run concurrently with * the _rcu mutation primitives such as rhashtable_insert() as long as the * traversal is guarded by rcu_read_lock(). */ #define rht_for_each_entry_rcu_from(tpos, pos, head, tbl, hash, member) \ for (({barrier(); }), \ pos = head; \ (!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member); \ pos = rht_dereference_bucket_rcu(pos->next, tbl, hash)) /** * rht_for_each_entry_rcu - iterate over rcu hash chain of given type * @tpos: the type * to use as a loop cursor. * @pos: the &struct rhash_head to use as a loop cursor. * @tbl: the &struct bucket_table * @hash: the hash value / bucket index * @member: name of the &struct rhash_head within the hashable struct. * * This hash chain list-traversal primitive may safely run concurrently with * the _rcu mutation primitives such as rhashtable_insert() as long as the * traversal is guarded by rcu_read_lock(). */ #define rht_for_each_entry_rcu(tpos, pos, tbl, hash, member) \ rht_for_each_entry_rcu_from(tpos, pos, \ rht_ptr_rcu(rht_bucket(tbl, hash)), \ tbl, hash, member) /** * rhl_for_each_rcu - iterate over rcu hash table list * @pos: the &struct rlist_head to use as a loop cursor. * @list: the head of the list * * This hash chain list-traversal primitive should be used on the * list returned by rhltable_lookup. */ #define rhl_for_each_rcu(pos, list) \ for (pos = list; pos; pos = rcu_dereference_raw(pos->next)) /** * rhl_for_each_entry_rcu - iterate over rcu hash table list of given type * @tpos: the type * to use as a loop cursor. * @pos: the &struct rlist_head to use as a loop cursor. * @list: the head of the list * @member: name of the &struct rlist_head within the hashable struct. * * This hash chain list-traversal primitive should be used on the * list returned by rhltable_lookup. */ #define rhl_for_each_entry_rcu(tpos, pos, list, member) \ for (pos = list; pos && rht_entry(tpos, pos, member); \ pos = rcu_dereference_raw(pos->next)) static inline int rhashtable_compare(struct rhashtable_compare_arg *arg, const void *obj) { struct rhashtable *ht = arg->ht; const char *ptr = obj; return memcmp(ptr + ht->p.key_offset, arg->key, ht->p.key_len); } /* Internal function, do not use. */ static inline struct rhash_head *__rhashtable_lookup( struct rhashtable *ht, const void *key, const struct rhashtable_params params) { struct rhashtable_compare_arg arg = { .ht = ht, .key = key, }; struct rhash_lock_head __rcu *const *bkt; struct bucket_table *tbl; struct rhash_head *he; unsigned int hash; tbl = rht_dereference_rcu(ht->tbl, ht); restart: hash = rht_key_hashfn(ht, tbl, key, params); bkt = rht_bucket(tbl, hash); do { rht_for_each_rcu_from(he, rht_ptr_rcu(bkt), tbl, hash) { if (params.obj_cmpfn ? params.obj_cmpfn(&arg, rht_obj(ht, he)) : rhashtable_compare(&arg, rht_obj(ht, he))) continue; return he; } /* An object might have been moved to a different hash chain, * while we walk along it - better check and retry. */ } while (he != RHT_NULLS_MARKER(bkt)); /* Ensure we see any new tables. */ smp_rmb(); tbl = rht_dereference_rcu(tbl->future_tbl, ht); if (unlikely(tbl)) goto restart; return NULL; } /** * rhashtable_lookup - search hash table * @ht: hash table * @key: the pointer to the key * @params: hash table parameters * * Computes the hash value for the key and traverses the bucket chain looking * for an entry with an identical key. The first matching entry is returned. * * This must only be called under the RCU read lock. * * Returns the first entry on which the compare function returned true. */ static inline void *rhashtable_lookup( struct rhashtable *ht, const void *key, const struct rhashtable_params params) { struct rhash_head *he = __rhashtable_lookup(ht, key, params); return he ? rht_obj(ht, he) : NULL; } /** * rhashtable_lookup_fast - search hash table, without RCU read lock * @ht: hash table * @key: the pointer to the key * @params: hash table parameters * * Computes the hash value for the key and traverses the bucket chain looking * for an entry with an identical key. The first matching entry is returned. * * Only use this function when you have other mechanisms guaranteeing * that the object won't go away after the RCU read lock is released. * * Returns the first entry on which the compare function returned true. */ static inline void *rhashtable_lookup_fast( struct rhashtable *ht, const void *key, const struct rhashtable_params params) { void *obj; rcu_read_lock(); obj = rhashtable_lookup(ht, key, params); rcu_read_unlock(); return obj; } /** * rhltable_lookup - search hash list table * @hlt: hash table * @key: the pointer to the key * @params: hash table parameters * * Computes the hash value for the key and traverses the bucket chain looking * for an entry with an identical key. All matching entries are returned * in a list. * * This must only be called under the RCU read lock. * * Returns the list of entries that match the given key. */ static inline struct rhlist_head *rhltable_lookup( struct rhltable *hlt, const void *key, const struct rhashtable_params params) { struct rhash_head *he = __rhashtable_lookup(&hlt->ht, key, params); return he ? container_of(he, struct rhlist_head, rhead) : NULL; } /* Internal function, please use rhashtable_insert_fast() instead. This * function returns the existing element already in hashes if there is a clash, * otherwise it returns an error via ERR_PTR(). */ static inline void *__rhashtable_insert_fast( struct rhashtable *ht, const void *key, struct rhash_head *obj, const struct rhashtable_params params, bool rhlist) { struct rhashtable_compare_arg arg = { .ht = ht, .key = key, }; struct rhash_lock_head __rcu **bkt; struct rhash_head __rcu **pprev; struct bucket_table *tbl; struct rhash_head *head; unsigned long flags; unsigned int hash; int elasticity; void *data; rcu_read_lock(); tbl = rht_dereference_rcu(ht->tbl, ht); hash = rht_head_hashfn(ht, tbl, obj, params); elasticity = RHT_ELASTICITY; bkt = rht_bucket_insert(ht, tbl, hash); data = ERR_PTR(-ENOMEM); if (!bkt) goto out; pprev = NULL; flags = rht_lock(tbl, bkt); if (unlikely(rcu_access_pointer(tbl->future_tbl))) { slow_path: rht_unlock(tbl, bkt, flags); rcu_read_unlock(); return rhashtable_insert_slow(ht, key, obj); } rht_for_each_from(head, rht_ptr(bkt, tbl, hash), tbl, hash) { struct rhlist_head *plist; struct rhlist_head *list; elasticity--; if (!key || (params.obj_cmpfn ? params.obj_cmpfn(&arg, rht_obj(ht, head)) : rhashtable_compare(&arg, rht_obj(ht, head)))) { pprev = &head->next; continue; } data = rht_obj(ht, head); if (!rhlist) goto out_unlock; list = container_of(obj, struct rhlist_head, rhead); plist = container_of(head, struct rhlist_head, rhead); RCU_INIT_POINTER(list->next, plist); head = rht_dereference_bucket(head->next, tbl, hash); RCU_INIT_POINTER(list->rhead.next, head); if (pprev) { rcu_assign_pointer(*pprev, obj); rht_unlock(tbl, bkt, flags); } else rht_assign_unlock(tbl, bkt, obj, flags); data = NULL; goto out; } if (elasticity <= 0) goto slow_path; data = ERR_PTR(-E2BIG); if (unlikely(rht_grow_above_max(ht, tbl))) goto out_unlock; if (unlikely(rht_grow_above_100(ht, tbl))) goto slow_path; /* Inserting at head of list makes unlocking free. */ head = rht_ptr(bkt, tbl, hash); RCU_INIT_POINTER(obj->next, head); if (rhlist) { struct rhlist_head *list; list = container_of(obj, struct rhlist_head, rhead); RCU_INIT_POINTER(list->next, NULL); } atomic_inc(&ht->nelems); rht_assign_unlock(tbl, bkt, obj, flags); if (rht_grow_above_75(ht, tbl)) schedule_work(&ht->run_work); data = NULL; out: rcu_read_unlock(); return data; out_unlock: rht_unlock(tbl, bkt, flags); goto out; } /** * rhashtable_insert_fast - insert object into hash table * @ht: hash table * @obj: pointer to hash head inside object * @params: hash table parameters * * Will take the per bucket bitlock to protect against mutual mutations * on the same bucket. Multiple insertions may occur in parallel unless * they map to the same bucket. * * It is safe to call this function from atomic context. * * Will trigger an automatic deferred table resizing if residency in the * table grows beyond 70%. */ static inline int rhashtable_insert_fast( struct rhashtable *ht, struct rhash_head *obj, const struct rhashtable_params params) { void *ret; ret = __rhashtable_insert_fast(ht, NULL, obj, params, false); if (IS_ERR(ret)) return PTR_ERR(ret); return ret == NULL ? 0 : -EEXIST; } /** * rhltable_insert_key - insert object into hash list table * @hlt: hash list table * @key: the pointer to the key * @list: pointer to hash list head inside object * @params: hash table parameters * * Will take the per bucket bitlock to protect against mutual mutations * on the same bucket. Multiple insertions may occur in parallel unless * they map to the same bucket. * * It is safe to call this function from atomic context. * * Will trigger an automatic deferred table resizing if residency in the * table grows beyond 70%. */ static inline int rhltable_insert_key( struct rhltable *hlt, const void *key, struct rhlist_head *list, const struct rhashtable_params params) { return PTR_ERR(__rhashtable_insert_fast(&hlt->ht, key, &list->rhead, params, true)); } /** * rhltable_insert - insert object into hash list table * @hlt: hash list table * @list: pointer to hash list head inside object * @params: hash table parameters * * Will take the per bucket bitlock to protect against mutual mutations * on the same bucket. Multiple insertions may occur in parallel unless * they map to the same bucket. * * It is safe to call this function from atomic context. * * Will trigger an automatic deferred table resizing if residency in the * table grows beyond 70%. */ static inline int rhltable_insert( struct rhltable *hlt, struct rhlist_head *list, const struct rhashtable_params params) { const char *key = rht_obj(&hlt->ht, &list->rhead); key += params.key_offset; return rhltable_insert_key(hlt, key, list, params); } /** * rhashtable_lookup_insert_fast - lookup and insert object into hash table * @ht: hash table * @obj: pointer to hash head inside object * @params: hash table parameters * * This lookup function may only be used for fixed key hash table (key_len * parameter set). It will BUG() if used inappropriately. * * It is safe to call this function from atomic context. * * Will trigger an automatic deferred table resizing if residency in the * table grows beyond 70%. */ static inline int rhashtable_lookup_insert_fast( struct rhashtable *ht, struct rhash_head *obj, const struct rhashtable_params params) { const char *key = rht_obj(ht, obj); void *ret; BUG_ON(ht->p.obj_hashfn); ret = __rhashtable_insert_fast(ht, key + ht->p.key_offset, obj, params, false); if (IS_ERR(ret)) return PTR_ERR(ret); return ret == NULL ? 0 : -EEXIST; } /** * rhashtable_lookup_get_insert_fast - lookup and insert object into hash table * @ht: hash table * @obj: pointer to hash head inside object * @params: hash table parameters * * Just like rhashtable_lookup_insert_fast(), but this function returns the * object if it exists, NULL if it did not and the insertion was successful, * and an ERR_PTR otherwise. */ static inline void *rhashtable_lookup_get_insert_fast( struct rhashtable *ht, struct rhash_head *obj, const struct rhashtable_params params) { const char *key = rht_obj(ht, obj); BUG_ON(ht->p.obj_hashfn); return __rhashtable_insert_fast(ht, key + ht->p.key_offset, obj, params, false); } /** * rhashtable_lookup_insert_key - search and insert object to hash table * with explicit key * @ht: hash table * @key: key * @obj: pointer to hash head inside object * @params: hash table parameters * * Lookups may occur in parallel with hashtable mutations and resizing. * * Will trigger an automatic deferred table resizing if residency in the * table grows beyond 70%. * * Returns zero on success. */ static inline int rhashtable_lookup_insert_key( struct rhashtable *ht, const void *key, struct rhash_head *obj, const struct rhashtable_params params) { void *ret; BUG_ON(!ht->p.obj_hashfn || !key); ret = __rhashtable_insert_fast(ht, key, obj, params, false); if (IS_ERR(ret)) return PTR_ERR(ret); return ret == NULL ? 0 : -EEXIST; } /** * rhashtable_lookup_get_insert_key - lookup and insert object into hash table * @ht: hash table * @key: key * @obj: pointer to hash head inside object * @params: hash table parameters * * Just like rhashtable_lookup_insert_key(), but this function returns the * object if it exists, NULL if it does not and the insertion was successful, * and an ERR_PTR otherwise. */ static inline void *rhashtable_lookup_get_insert_key( struct rhashtable *ht, const void *key, struct rhash_head *obj, const struct rhashtable_params params) { BUG_ON(!ht->p.obj_hashfn || !key); return __rhashtable_insert_fast(ht, key, obj, params, false); } /* Internal function, please use rhashtable_remove_fast() instead */ static inline int __rhashtable_remove_fast_one( struct rhashtable *ht, struct bucket_table *tbl, struct rhash_head *obj, const struct rhashtable_params params, bool rhlist) { struct rhash_lock_head __rcu **bkt; struct rhash_head __rcu **pprev; struct rhash_head *he; unsigned long flags; unsigned int hash; int err = -ENOENT; hash = rht_head_hashfn(ht, tbl, obj, params); bkt = rht_bucket_var(tbl, hash); if (!bkt) return -ENOENT; pprev = NULL; flags = rht_lock(tbl, bkt); rht_for_each_from(he, rht_ptr(bkt, tbl, hash), tbl, hash) { struct rhlist_head *list; list = container_of(he, struct rhlist_head, rhead); if (he != obj) { struct rhlist_head __rcu **lpprev; pprev = &he->next; if (!rhlist) continue; do { lpprev = &list->next; list = rht_dereference_bucket(list->next, tbl, hash); } while (list && obj != &list->rhead); if (!list) continue; list = rht_dereference_bucket(list->next, tbl, hash); RCU_INIT_POINTER(*lpprev, list); err = 0; break; } obj = rht_dereference_bucket(obj->next, tbl, hash); err = 1; if (rhlist) { list = rht_dereference_bucket(list->next, tbl, hash); if (list) { RCU_INIT_POINTER(list->rhead.next, obj); obj = &list->rhead; err = 0; } } if (pprev) { rcu_assign_pointer(*pprev, obj); rht_unlock(tbl, bkt, flags); } else { rht_assign_unlock(tbl, bkt, obj, flags); } goto unlocked; } rht_unlock(tbl, bkt, flags); unlocked: if (err > 0) { atomic_dec(&ht->nelems); if (unlikely(ht->p.automatic_shrinking && rht_shrink_below_30(ht, tbl))) schedule_work(&ht->run_work); err = 0; } return err; } /* Internal function, please use rhashtable_remove_fast() instead */ static inline int __rhashtable_remove_fast( struct rhashtable *ht, struct rhash_head *obj, const struct rhashtable_params params, bool rhlist) { struct bucket_table *tbl; int err; rcu_read_lock(); tbl = rht_dereference_rcu(ht->tbl, ht); /* Because we have already taken (and released) the bucket * lock in old_tbl, if we find that future_tbl is not yet * visible then that guarantees the entry to still be in * the old tbl if it exists. */ while ((err = __rhashtable_remove_fast_one(ht, tbl, obj, params, rhlist)) && (tbl = rht_dereference_rcu(tbl->future_tbl, ht))) ; rcu_read_unlock(); return err; } /** * rhashtable_remove_fast - remove object from hash table * @ht: hash table * @obj: pointer to hash head inside object * @params: hash table parameters * * Since the hash chain is single linked, the removal operation needs to * walk the bucket chain upon removal. The removal operation is thus * considerable slow if the hash table is not correctly sized. * * Will automatically shrink the table if permitted when residency drops * below 30%. * * Returns zero on success, -ENOENT if the entry could not be found. */ static inline int rhashtable_remove_fast( struct rhashtable *ht, struct rhash_head *obj, const struct rhashtable_params params) { return __rhashtable_remove_fast(ht, obj, params, false); } /** * rhltable_remove - remove object from hash list table * @hlt: hash list table * @list: pointer to hash list head inside object * @params: hash table parameters * * Since the hash chain is single linked, the removal operation needs to * walk the bucket chain upon removal. The removal operation is thus * considerably slower if the hash table is not correctly sized. * * Will automatically shrink the table if permitted when residency drops * below 30% * * Returns zero on success, -ENOENT if the entry could not be found. */ static inline int rhltable_remove( struct rhltable *hlt, struct rhlist_head *list, const struct rhashtable_params params) { return __rhashtable_remove_fast(&hlt->ht, &list->rhead, params, true); } /* Internal function, please use rhashtable_replace_fast() instead */ static inline int __rhashtable_replace_fast( struct rhashtable *ht, struct bucket_table *tbl, struct rhash_head *obj_old, struct rhash_head *obj_new, const struct rhashtable_params params) { struct rhash_lock_head __rcu **bkt; struct rhash_head __rcu **pprev; struct rhash_head *he; unsigned long flags; unsigned int hash; int err = -ENOENT; /* Minimally, the old and new objects must have same hash * (which should mean identifiers are the same). */ hash = rht_head_hashfn(ht, tbl, obj_old, params); if (hash != rht_head_hashfn(ht, tbl, obj_new, params)) return -EINVAL; bkt = rht_bucket_var(tbl, hash); if (!bkt) return -ENOENT; pprev = NULL; flags = rht_lock(tbl, bkt); rht_for_each_from(he, rht_ptr(bkt, tbl, hash), tbl, hash) { if (he != obj_old) { pprev = &he->next; continue; } rcu_assign_pointer(obj_new->next, obj_old->next); if (pprev) { rcu_assign_pointer(*pprev, obj_new); rht_unlock(tbl, bkt, flags); } else { rht_assign_unlock(tbl, bkt, obj_new, flags); } err = 0; goto unlocked; } rht_unlock(tbl, bkt, flags); unlocked: return err; } /** * rhashtable_replace_fast - replace an object in hash table * @ht: hash table * @obj_old: pointer to hash head inside object being replaced * @obj_new: pointer to hash head inside object which is new * @params: hash table parameters * * Replacing an object doesn't affect the number of elements in the hash table * or bucket, so we don't need to worry about shrinking or expanding the * table here. * * Returns zero on success, -ENOENT if the entry could not be found, * -EINVAL if hash is not the same for the old and new objects. */ static inline int rhashtable_replace_fast( struct rhashtable *ht, struct rhash_head *obj_old, struct rhash_head *obj_new, const struct rhashtable_params params) { struct bucket_table *tbl; int err; rcu_read_lock(); tbl = rht_dereference_rcu(ht->tbl, ht); /* Because we have already taken (and released) the bucket * lock in old_tbl, if we find that future_tbl is not yet * visible then that guarantees the entry to still be in * the old tbl if it exists. */ while ((err = __rhashtable_replace_fast(ht, tbl, obj_old, obj_new, params)) && (tbl = rht_dereference_rcu(tbl->future_tbl, ht))) ; rcu_read_unlock(); return err; } /** * rhltable_walk_enter - Initialise an iterator * @hlt: Table to walk over * @iter: Hash table Iterator * * This function prepares a hash table walk. * * Note that if you restart a walk after rhashtable_walk_stop you * may see the same object twice. Also, you may miss objects if * there are removals in between rhashtable_walk_stop and the next * call to rhashtable_walk_start. * * For a completely stable walk you should construct your own data * structure outside the hash table. * * This function may be called from any process context, including * non-preemptable context, but cannot be called from softirq or * hardirq context. * * You must call rhashtable_walk_exit after this function returns. */ static inline void rhltable_walk_enter(struct rhltable *hlt, struct rhashtable_iter *iter) { rhashtable_walk_enter(&hlt->ht, iter); } /** * rhltable_free_and_destroy - free elements and destroy hash list table * @hlt: the hash list table to destroy * @free_fn: callback to release resources of element * @arg: pointer passed to free_fn * * See documentation for rhashtable_free_and_destroy. */ static inline void rhltable_free_and_destroy(struct rhltable *hlt, void (*free_fn)(void *ptr, void *arg), void *arg) { rhashtable_free_and_destroy(&hlt->ht, free_fn, arg); } static inline void rhltable_destroy(struct rhltable *hlt) { rhltable_free_and_destroy(hlt, NULL, NULL); } #endif /* _LINUX_RHASHTABLE_H */ |
| 85 10 23 130 58 299 300 47 48 640 68 78 78 113 130 6 2 4 4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 | /* SPDX-License-Identifier: GPL-2.0-or-later */ /* * INET An implementation of the TCP/IP protocol suite for the LINUX * operating system. INET is implemented using the BSD Socket * interface as the means of communication with the user level. * * Definitions for the UDP module. * * Version: @(#)udp.h 1.0.2 05/07/93 * * Authors: Ross Biro * Fred N. van Kempen, <waltje@uWalt.NL.Mugnet.ORG> * * Fixes: * Alan Cox : Turned on udp checksums. I don't want to * chase 'memory corruption' bugs that aren't! */ #ifndef _UDP_H #define _UDP_H #include <linux/list.h> #include <linux/bug.h> #include <net/inet_sock.h> #include <net/gso.h> #include <net/sock.h> #include <net/snmp.h> #include <net/ip.h> #include <linux/ipv6.h> #include <linux/seq_file.h> #include <linux/poll.h> #include <linux/indirect_call_wrapper.h> /** * struct udp_skb_cb - UDP(-Lite) private variables * * @header: private variables used by IPv4/IPv6 * @cscov: checksum coverage length (UDP-Lite only) * @partial_cov: if set indicates partial csum coverage */ struct udp_skb_cb { union { struct inet_skb_parm h4; #if IS_ENABLED(CONFIG_IPV6) struct inet6_skb_parm h6; #endif } header; __u16 cscov; __u8 partial_cov; }; #define UDP_SKB_CB(__skb) ((struct udp_skb_cb *)((__skb)->cb)) /** * struct udp_hslot - UDP hash slot used by udp_table.hash/hash4 * * @head: head of list of sockets * @nulls_head: head of list of sockets, only used by hash4 * @count: number of sockets in 'head' list * @lock: spinlock protecting changes to head/count */ struct udp_hslot { union { struct hlist_head head; /* hash4 uses hlist_nulls to avoid moving wrongly onto another * hlist, because rehash() can happen with lookup(). */ struct hlist_nulls_head nulls_head; }; int count; spinlock_t lock; } __aligned(2 * sizeof(long)); /** * struct udp_hslot_main - UDP hash slot used by udp_table.hash2 * * @hslot: basic hash slot * @hash4_cnt: number of sockets in hslot4 of the same * (local port, local address) */ struct udp_hslot_main { struct udp_hslot hslot; /* must be the first member */ #if !IS_ENABLED(CONFIG_BASE_SMALL) u32 hash4_cnt; #endif } __aligned(2 * sizeof(long)); #define UDP_HSLOT_MAIN(__hslot) ((struct udp_hslot_main *)(__hslot)) /** * struct udp_table - UDP table * * @hash: hash table, sockets are hashed on (local port) * @hash2: hash table, sockets are hashed on (local port, local address) * @hash4: hash table, connected sockets are hashed on * (local port, local address, remote port, remote address) * @mask: number of slots in hash tables, minus 1 * @log: log2(number of slots in hash table) */ struct udp_table { struct udp_hslot *hash; struct udp_hslot_main *hash2; #if !IS_ENABLED(CONFIG_BASE_SMALL) struct udp_hslot *hash4; #endif unsigned int mask; unsigned int log; }; extern struct udp_table udp_table; void udp_table_init(struct udp_table *, const char *); static inline struct udp_hslot *udp_hashslot(struct udp_table *table, const struct net *net, unsigned int num) { return &table->hash[udp_hashfn(net, num, table->mask)]; } /* * For secondary hash, net_hash_mix() is performed before calling * udp_hashslot2(), this explains difference with udp_hashslot() */ static inline struct udp_hslot *udp_hashslot2(struct udp_table *table, unsigned int hash) { return &table->hash2[hash & table->mask].hslot; } #if IS_ENABLED(CONFIG_BASE_SMALL) static inline void udp_table_hash4_init(struct udp_table *table) { } static inline struct udp_hslot *udp_hashslot4(struct udp_table *table, unsigned int hash) { BUILD_BUG(); return NULL; } static inline bool udp_hashed4(const struct sock *sk) { return false; } static inline unsigned int udp_hash4_slot_size(void) { return 0; } static inline bool udp_has_hash4(const struct udp_hslot *hslot2) { return false; } static inline void udp_hash4_inc(struct udp_hslot *hslot2) { } static inline void udp_hash4_dec(struct udp_hslot *hslot2) { } #else /* !CONFIG_BASE_SMALL */ /* Must be called with table->hash2 initialized */ static inline void udp_table_hash4_init(struct udp_table *table) { table->hash4 = (void *)(table->hash2 + (table->mask + 1)); for (int i = 0; i <= table->mask; i++) { table->hash2[i].hash4_cnt = 0; INIT_HLIST_NULLS_HEAD(&table->hash4[i].nulls_head, i); table->hash4[i].count = 0; spin_lock_init(&table->hash4[i].lock); } } static inline struct udp_hslot *udp_hashslot4(struct udp_table *table, unsigned int hash) { return &table->hash4[hash & table->mask]; } static inline bool udp_hashed4(const struct sock *sk) { return !hlist_nulls_unhashed(&udp_sk(sk)->udp_lrpa_node); } static inline unsigned int udp_hash4_slot_size(void) { return sizeof(struct udp_hslot); } static inline bool udp_has_hash4(const struct udp_hslot *hslot2) { return UDP_HSLOT_MAIN(hslot2)->hash4_cnt; } static inline void udp_hash4_inc(struct udp_hslot *hslot2) { UDP_HSLOT_MAIN(hslot2)->hash4_cnt++; } static inline void udp_hash4_dec(struct udp_hslot *hslot2) { UDP_HSLOT_MAIN(hslot2)->hash4_cnt--; } #endif /* CONFIG_BASE_SMALL */ extern struct proto udp_prot; DECLARE_PER_CPU(int, udp_memory_per_cpu_fw_alloc); /* sysctl variables for udp */ extern long sysctl_udp_mem[3]; extern int sysctl_udp_rmem_min; extern int sysctl_udp_wmem_min; struct sk_buff; /* * Generic checksumming routines for UDP(-Lite) v4 and v6 */ static inline __sum16 __udp_lib_checksum_complete(struct sk_buff *skb) { return (UDP_SKB_CB(skb)->cscov == skb->len ? __skb_checksum_complete(skb) : __skb_checksum_complete_head(skb, UDP_SKB_CB(skb)->cscov)); } static inline int udp_lib_checksum_complete(struct sk_buff *skb) { return !skb_csum_unnecessary(skb) && __udp_lib_checksum_complete(skb); } /** * udp_csum_outgoing - compute UDPv4/v6 checksum over fragments * @sk: socket we are writing to * @skb: sk_buff containing the filled-in UDP header * (checksum field must be zeroed out) */ static inline __wsum udp_csum_outgoing(struct sock *sk, struct sk_buff *skb) { __wsum csum = csum_partial(skb_transport_header(skb), sizeof(struct udphdr), 0); skb_queue_walk(&sk->sk_write_queue, skb) { csum = csum_add(csum, skb->csum); } return csum; } static inline __wsum udp_csum(struct sk_buff *skb) { __wsum csum = csum_partial(skb_transport_header(skb), sizeof(struct udphdr), skb->csum); for (skb = skb_shinfo(skb)->frag_list; skb; skb = skb->next) { csum = csum_add(csum, skb->csum); } return csum; } static inline __sum16 udp_v4_check(int len, __be32 saddr, __be32 daddr, __wsum base) { return csum_tcpudp_magic(saddr, daddr, len, IPPROTO_UDP, base); } void udp_set_csum(bool nocheck, struct sk_buff *skb, __be32 saddr, __be32 daddr, int len); static inline void udp_csum_pull_header(struct sk_buff *skb) { if (!skb->csum_valid && skb->ip_summed == CHECKSUM_NONE) skb->csum = csum_partial(skb->data, sizeof(struct udphdr), skb->csum); skb_pull_rcsum(skb, sizeof(struct udphdr)); UDP_SKB_CB(skb)->cscov -= sizeof(struct udphdr); } typedef struct sock *(*udp_lookup_t)(const struct sk_buff *skb, __be16 sport, __be16 dport); void udp_v6_early_demux(struct sk_buff *skb); INDIRECT_CALLABLE_DECLARE(int udpv6_rcv(struct sk_buff *)); struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb, netdev_features_t features, bool is_ipv6); static inline void udp_lib_init_sock(struct sock *sk) { struct udp_sock *up = udp_sk(sk); skb_queue_head_init(&up->reader_queue); INIT_HLIST_NODE(&up->tunnel_list); up->forward_threshold = sk->sk_rcvbuf >> 2; set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags); } /* hash routines shared between UDPv4/6 and UDP-Litev4/6 */ static inline int udp_lib_hash(struct sock *sk) { BUG(); return 0; } void udp_lib_unhash(struct sock *sk); void udp_lib_rehash(struct sock *sk, u16 new_hash, u16 new_hash4); u32 udp_ehashfn(const struct net *net, const __be32 laddr, const __u16 lport, const __be32 faddr, const __be16 fport); static inline void udp_lib_close(struct sock *sk, long timeout) { sk_common_release(sk); } /* hash4 routines shared between UDPv4/6 */ #if IS_ENABLED(CONFIG_BASE_SMALL) static inline void udp_lib_hash4(struct sock *sk, u16 hash) { } static inline void udp4_hash4(struct sock *sk) { } #else /* !CONFIG_BASE_SMALL */ void udp_lib_hash4(struct sock *sk, u16 hash); void udp4_hash4(struct sock *sk); #endif /* CONFIG_BASE_SMALL */ int udp_lib_get_port(struct sock *sk, unsigned short snum, unsigned int hash2_nulladdr); u32 udp_flow_hashrnd(void); static inline __be16 udp_flow_src_port(struct net *net, struct sk_buff *skb, int min, int max, bool use_eth) { u32 hash; if (min >= max) { /* Use default range */ inet_get_local_port_range(net, &min, &max); } hash = skb_get_hash(skb); if (unlikely(!hash)) { if (use_eth) { /* Can't find a normal hash, caller has indicated an * Ethernet packet so use that to compute a hash. */ hash = jhash(skb->data, 2 * ETH_ALEN, (__force u32) skb->protocol); } else { /* Can't derive any sort of hash for the packet, set * to some consistent random value. */ hash = udp_flow_hashrnd(); } } /* Since this is being sent on the wire obfuscate hash a bit * to minimize possibility that any useful information to an * attacker is leaked. Only upper 16 bits are relevant in the * computation for 16 bit port value. */ hash ^= hash << 16; return htons((((u64) hash * (max - min)) >> 32) + min); } static inline int udp_rqueue_get(struct sock *sk) { return sk_rmem_alloc_get(sk) - READ_ONCE(udp_sk(sk)->forward_deficit); } static inline bool udp_sk_bound_dev_eq(const struct net *net, int bound_dev_if, int dif, int sdif) { #if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV) return inet_bound_dev_eq(!!READ_ONCE(net->ipv4.sysctl_udp_l3mdev_accept), bound_dev_if, dif, sdif); #else return inet_bound_dev_eq(true, bound_dev_if, dif, sdif); #endif } /* net/ipv4/udp.c */ void udp_destruct_common(struct sock *sk); void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len); int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb); void udp_skb_destructor(struct sock *sk, struct sk_buff *skb); struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags, int *off, int *err); static inline struct sk_buff *skb_recv_udp(struct sock *sk, unsigned int flags, int *err) { int off = 0; return __skb_recv_udp(sk, flags, &off, err); } int udp_v4_early_demux(struct sk_buff *skb); bool udp_sk_rx_dst_set(struct sock *sk, struct dst_entry *dst); int udp_err(struct sk_buff *, u32); int udp_abort(struct sock *sk, int err); int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len); void udp_splice_eof(struct socket *sock); int udp_push_pending_frames(struct sock *sk); void udp_flush_pending_frames(struct sock *sk); int udp_cmsg_send(struct sock *sk, struct msghdr *msg, u16 *gso_size); void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst); int udp_rcv(struct sk_buff *skb); int udp_ioctl(struct sock *sk, int cmd, int *karg); int udp_init_sock(struct sock *sk); int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len); int __udp_disconnect(struct sock *sk, int flags); int udp_disconnect(struct sock *sk, int flags); __poll_t udp_poll(struct file *file, struct socket *sock, poll_table *wait); struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb, netdev_features_t features, bool is_ipv6); int udp_lib_getsockopt(struct sock *sk, int level, int optname, char __user *optval, int __user *optlen); int udp_lib_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval, unsigned int optlen, int (*push_pending_frames)(struct sock *)); struct sock *udp4_lib_lookup(const struct net *net, __be32 saddr, __be16 sport, __be32 daddr, __be16 dport, int dif); struct sock *__udp4_lib_lookup(const struct net *net, __be32 saddr, __be16 sport, __be32 daddr, __be16 dport, int dif, int sdif, struct udp_table *tbl, struct sk_buff *skb); struct sock *udp4_lib_lookup_skb(const struct sk_buff *skb, __be16 sport, __be16 dport); struct sock *udp6_lib_lookup(const struct net *net, const struct in6_addr *saddr, __be16 sport, const struct in6_addr *daddr, __be16 dport, int dif); struct sock *__udp6_lib_lookup(const struct net *net, const struct in6_addr *saddr, __be16 sport, const struct in6_addr *daddr, __be16 dport, int dif, int sdif, struct udp_table *tbl, struct sk_buff *skb); struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb, __be16 sport, __be16 dport); int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor); /* UDP uses skb->dev_scratch to cache as much information as possible and avoid * possibly multiple cache miss on dequeue() */ struct udp_dev_scratch { /* skb->truesize and the stateless bit are embedded in a single field; * do not use a bitfield since the compiler emits better/smaller code * this way */ u32 _tsize_state; #if BITS_PER_LONG == 64 /* len and the bit needed to compute skb_csum_unnecessary * will be on cold cache lines at recvmsg time. * skb->len can be stored on 16 bits since the udp header has been * already validated and pulled. */ u16 len; bool is_linear; bool csum_unnecessary; #endif }; static inline struct udp_dev_scratch *udp_skb_scratch(struct sk_buff *skb) { return (struct udp_dev_scratch *)&skb->dev_scratch; } #if BITS_PER_LONG == 64 static inline unsigned int udp_skb_len(struct sk_buff *skb) { return udp_skb_scratch(skb)->len; } static inline bool udp_skb_csum_unnecessary(struct sk_buff *skb) { return udp_skb_scratch(skb)->csum_unnecessary; } static inline bool udp_skb_is_linear(struct sk_buff *skb) { return udp_skb_scratch(skb)->is_linear; } #else static inline unsigned int udp_skb_len(struct sk_buff *skb) { return skb->len; } static inline bool udp_skb_csum_unnecessary(struct sk_buff *skb) { return skb_csum_unnecessary(skb); } static inline bool udp_skb_is_linear(struct sk_buff *skb) { return !skb_is_nonlinear(skb); } #endif static inline int copy_linear_skb(struct sk_buff *skb, int len, int off, struct iov_iter *to) { return copy_to_iter_full(skb->data + off, len, to) ? 0 : -EFAULT; } /* * SNMP statistics for UDP and UDP-Lite */ #define UDP_INC_STATS(net, field, is_udplite) do { \ if (is_udplite) SNMP_INC_STATS((net)->mib.udplite_statistics, field); \ else SNMP_INC_STATS((net)->mib.udp_statistics, field); } while(0) #define __UDP_INC_STATS(net, field, is_udplite) do { \ if (is_udplite) __SNMP_INC_STATS((net)->mib.udplite_statistics, field); \ else __SNMP_INC_STATS((net)->mib.udp_statistics, field); } while(0) #define __UDP6_INC_STATS(net, field, is_udplite) do { \ if (is_udplite) __SNMP_INC_STATS((net)->mib.udplite_stats_in6, field);\ else __SNMP_INC_STATS((net)->mib.udp_stats_in6, field); \ } while(0) #define UDP6_INC_STATS(net, field, __lite) do { \ if (__lite) SNMP_INC_STATS((net)->mib.udplite_stats_in6, field); \ else SNMP_INC_STATS((net)->mib.udp_stats_in6, field); \ } while(0) #if IS_ENABLED(CONFIG_IPV6) #define __UDPX_MIB(sk, ipv4) \ ({ \ ipv4 ? (IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_statistics : \ sock_net(sk)->mib.udp_statistics) : \ (IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_stats_in6 : \ sock_net(sk)->mib.udp_stats_in6); \ }) #else #define __UDPX_MIB(sk, ipv4) \ ({ \ IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_statistics : \ sock_net(sk)->mib.udp_statistics; \ }) #endif #define __UDPX_INC_STATS(sk, field) \ __SNMP_INC_STATS(__UDPX_MIB(sk, (sk)->sk_family == AF_INET), field) #ifdef CONFIG_PROC_FS struct udp_seq_afinfo { sa_family_t family; struct udp_table *udp_table; }; struct udp_iter_state { struct seq_net_private p; int bucket; }; void *udp_seq_start(struct seq_file *seq, loff_t *pos); void *udp_seq_next(struct seq_file *seq, void *v, loff_t *pos); void udp_seq_stop(struct seq_file *seq, void *v); extern const struct seq_operations udp_seq_ops; extern const struct seq_operations udp6_seq_ops; int udp4_proc_init(void); void udp4_proc_exit(void); #endif /* CONFIG_PROC_FS */ int udpv4_offload_init(void); void udp_init(void); DECLARE_STATIC_KEY_FALSE(udp_encap_needed_key); void udp_encap_enable(void); void udp_encap_disable(void); #if IS_ENABLED(CONFIG_IPV6) DECLARE_STATIC_KEY_FALSE(udpv6_encap_needed_key); void udpv6_encap_enable(void); #endif static inline struct sk_buff *udp_rcv_segment(struct sock *sk, struct sk_buff *skb, bool ipv4) { netdev_features_t features = NETIF_F_SG; struct sk_buff *segs; int drop_count; /* * Segmentation in UDP receive path is only for UDP GRO, drop udp * fragmentation offload (UFO) packets. */ if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP) { drop_count = 1; goto drop; } /* Avoid csum recalculation by skb_segment unless userspace explicitly * asks for the final checksum values */ if (!inet_get_convert_csum(sk)) features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM; /* UDP segmentation expects packets of type CHECKSUM_PARTIAL or * CHECKSUM_NONE in __udp_gso_segment. UDP GRO indeed builds partial * packets in udp_gro_complete_segment. As does UDP GSO, verified by * udp_send_skb. But when those packets are looped in dev_loopback_xmit * their ip_summed CHECKSUM_NONE is changed to CHECKSUM_UNNECESSARY. * Reset in this specific case, where PARTIAL is both correct and * required. */ if (skb->pkt_type == PACKET_LOOPBACK) skb->ip_summed = CHECKSUM_PARTIAL; /* the GSO CB lays after the UDP one, no need to save and restore any * CB fragment */ segs = __skb_gso_segment(skb, features, false); if (IS_ERR_OR_NULL(segs)) { drop_count = skb_shinfo(skb)->gso_segs; goto drop; } consume_skb(skb); return segs; drop: atomic_add(drop_count, &sk->sk_drops); SNMP_ADD_STATS(__UDPX_MIB(sk, ipv4), UDP_MIB_INERRORS, drop_count); kfree_skb(skb); return NULL; } static inline void udp_post_segment_fix_csum(struct sk_buff *skb) { /* UDP-lite can't land here - no GRO */ WARN_ON_ONCE(UDP_SKB_CB(skb)->partial_cov); /* UDP packets generated with UDP_SEGMENT and traversing: * * UDP tunnel(xmit) -> veth (segmentation) -> veth (gro) -> UDP tunnel (rx) * * can reach an UDP socket with CHECKSUM_NONE, because * __iptunnel_pull_header() converts CHECKSUM_PARTIAL into NONE. * SKB_GSO_UDP_L4 or SKB_GSO_FRAGLIST packets with no UDP tunnel will * have a valid checksum, as the GRO engine validates the UDP csum * before the aggregation and nobody strips such info in between. * Instead of adding another check in the tunnel fastpath, we can force * a valid csum after the segmentation. * Additionally fixup the UDP CB. */ UDP_SKB_CB(skb)->cscov = skb->len; if (skb->ip_summed == CHECKSUM_NONE && !skb->csum_valid) skb->csum_valid = 1; } #ifdef CONFIG_BPF_SYSCALL struct sk_psock; int udp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore); #endif #endif /* _UDP_H */ |
| 12344 34791 34789 34785 12344 36266 36337 36253 36258 2 1756 13237 13254 61 13236 13213 1585 61 34795 34866 3 3 4 1 1 2 58 58 40 9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 | // SPDX-License-Identifier: GPL-2.0 /* * Lockless hierarchical page accounting & limiting * * Copyright (C) 2014 Red Hat, Inc., Johannes Weiner */ #include <linux/page_counter.h> #include <linux/atomic.h> #include <linux/kernel.h> #include <linux/string.h> #include <linux/sched.h> #include <linux/bug.h> #include <asm/page.h> static bool track_protection(struct page_counter *c) { return c->protection_support; } static void propagate_protected_usage(struct page_counter *c, unsigned long usage) { unsigned long protected, old_protected; long delta; if (!c->parent) return; protected = min(usage, READ_ONCE(c->min)); old_protected = atomic_long_read(&c->min_usage); if (protected != old_protected) { old_protected = atomic_long_xchg(&c->min_usage, protected); delta = protected - old_protected; if (delta) atomic_long_add(delta, &c->parent->children_min_usage); } protected = min(usage, READ_ONCE(c->low)); old_protected = atomic_long_read(&c->low_usage); if (protected != old_protected) { old_protected = atomic_long_xchg(&c->low_usage, protected); delta = protected - old_protected; if (delta) atomic_long_add(delta, &c->parent->children_low_usage); } } /** * page_counter_cancel - take pages out of the local counter * @counter: counter * @nr_pages: number of pages to cancel */ void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages) { long new; new = atomic_long_sub_return(nr_pages, &counter->usage); /* More uncharges than charges? */ if (WARN_ONCE(new < 0, "page_counter underflow: %ld nr_pages=%lu\n", new, nr_pages)) { new = 0; atomic_long_set(&counter->usage, new); } if (track_protection(counter)) propagate_protected_usage(counter, new); } /** * page_counter_charge - hierarchically charge pages * @counter: counter * @nr_pages: number of pages to charge * * NOTE: This does not consider any configured counter limits. */ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages) { struct page_counter *c; bool protection = track_protection(counter); for (c = counter; c; c = c->parent) { long new; new = atomic_long_add_return(nr_pages, &c->usage); if (protection) propagate_protected_usage(c, new); /* * This is indeed racy, but we can live with some * inaccuracy in the watermark. * * Notably, we have two watermarks to allow for both a globally * visible peak and one that can be reset at a smaller scope. * * Since we reset both watermarks when the global reset occurs, * we can guarantee that watermark >= local_watermark, so we * don't need to do both comparisons every time. * * On systems with branch predictors, the inner condition should * be almost free. */ if (new > READ_ONCE(c->local_watermark)) { WRITE_ONCE(c->local_watermark, new); if (new > READ_ONCE(c->watermark)) WRITE_ONCE(c->watermark, new); } } } /** * page_counter_try_charge - try to hierarchically charge pages * @counter: counter * @nr_pages: number of pages to charge * @fail: points first counter to hit its limit, if any * * Returns %true on success, or %false and @fail if the counter or one * of its ancestors has hit its configured limit. */ bool page_counter_try_charge(struct page_counter *counter, unsigned long nr_pages, struct page_counter **fail) { struct page_counter *c; bool protection = track_protection(counter); bool track_failcnt = counter->track_failcnt; for (c = counter; c; c = c->parent) { long new; /* * Charge speculatively to avoid an expensive CAS. If * a bigger charge fails, it might falsely lock out a * racing smaller charge and send it into reclaim * early, but the error is limited to the difference * between the two sizes, which is less than 2M/4M in * case of a THP locking out a regular page charge. * * The atomic_long_add_return() implies a full memory * barrier between incrementing the count and reading * the limit. When racing with page_counter_set_max(), * we either see the new limit or the setter sees the * counter has changed and retries. */ new = atomic_long_add_return(nr_pages, &c->usage); if (new > c->max) { atomic_long_sub(nr_pages, &c->usage); /* * This is racy, but we can live with some * inaccuracy in the failcnt which is only used * to report stats. */ if (track_failcnt) data_race(c->failcnt++); *fail = c; goto failed; } if (protection) propagate_protected_usage(c, new); /* see comment on page_counter_charge */ if (new > READ_ONCE(c->local_watermark)) { WRITE_ONCE(c->local_watermark, new); if (new > READ_ONCE(c->watermark)) WRITE_ONCE(c->watermark, new); } } return true; failed: for (c = counter; c != *fail; c = c->parent) page_counter_cancel(c, nr_pages); return false; } /** * page_counter_uncharge - hierarchically uncharge pages * @counter: counter * @nr_pages: number of pages to uncharge */ void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages) { struct page_counter *c; for (c = counter; c; c = c->parent) page_counter_cancel(c, nr_pages); } /** * page_counter_set_max - set the maximum number of pages allowed * @counter: counter * @nr_pages: limit to set * * Returns 0 on success, -EBUSY if the current number of pages on the * counter already exceeds the specified limit. * * The caller must serialize invocations on the same counter. */ int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages) { for (;;) { unsigned long old; long usage; /* * Update the limit while making sure that it's not * below the concurrently-changing counter value. * * The xchg implies two full memory barriers before * and after, so the read-swap-read is ordered and * ensures coherency with page_counter_try_charge(): * that function modifies the count before checking * the limit, so if it sees the old limit, we see the * modified counter and retry. */ usage = page_counter_read(counter); if (usage > nr_pages) return -EBUSY; old = xchg(&counter->max, nr_pages); if (page_counter_read(counter) <= usage || nr_pages >= old) return 0; counter->max = old; cond_resched(); } } /** * page_counter_set_min - set the amount of protected memory * @counter: counter * @nr_pages: value to set * * The caller must serialize invocations on the same counter. */ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages) { struct page_counter *c; WRITE_ONCE(counter->min, nr_pages); for (c = counter; c; c = c->parent) propagate_protected_usage(c, atomic_long_read(&c->usage)); } /** * page_counter_set_low - set the amount of protected memory * @counter: counter * @nr_pages: value to set * * The caller must serialize invocations on the same counter. */ void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages) { struct page_counter *c; WRITE_ONCE(counter->low, nr_pages); for (c = counter; c; c = c->parent) propagate_protected_usage(c, atomic_long_read(&c->usage)); } /** * page_counter_memparse - memparse() for page counter limits * @buf: string to parse * @max: string meaning maximum possible value * @nr_pages: returns the result in number of pages * * Returns -EINVAL, or 0 and @nr_pages on success. @nr_pages will be * limited to %PAGE_COUNTER_MAX. */ int page_counter_memparse(const char *buf, const char *max, unsigned long *nr_pages) { char *end; u64 bytes; if (!strcmp(buf, max)) { *nr_pages = PAGE_COUNTER_MAX; return 0; } bytes = memparse(buf, &end); if (*end != '\0') return -EINVAL; *nr_pages = min(bytes / PAGE_SIZE, (u64)PAGE_COUNTER_MAX); return 0; } #if IS_ENABLED(CONFIG_MEMCG) || IS_ENABLED(CONFIG_CGROUP_DMEM) /* * This function calculates an individual page counter's effective * protection which is derived from its own memory.min/low, its * parent's and siblings' settings, as well as the actual memory * distribution in the tree. * * The following rules apply to the effective protection values: * * 1. At the first level of reclaim, effective protection is equal to * the declared protection in memory.min and memory.low. * * 2. To enable safe delegation of the protection configuration, at * subsequent levels the effective protection is capped to the * parent's effective protection. * * 3. To make complex and dynamic subtrees easier to configure, the * user is allowed to overcommit the declared protection at a given * level. If that is the case, the parent's effective protection is * distributed to the children in proportion to how much protection * they have declared and how much of it they are utilizing. * * This makes distribution proportional, but also work-conserving: * if one counter claims much more protection than it uses memory, * the unused remainder is available to its siblings. * * 4. Conversely, when the declared protection is undercommitted at a * given level, the distribution of the larger parental protection * budget is NOT proportional. A counter's protection from a sibling * is capped to its own memory.min/low setting. * * 5. However, to allow protecting recursive subtrees from each other * without having to declare each individual counter's fixed share * of the ancestor's claim to protection, any unutilized - * "floating" - protection from up the tree is distributed in * proportion to each counter's *usage*. This makes the protection * neutral wrt sibling cgroups and lets them compete freely over * the shared parental protection budget, but it protects the * subtree as a whole from neighboring subtrees. * * Note that 4. and 5. are not in conflict: 4. is about protecting * against immediate siblings whereas 5. is about protecting against * neighboring subtrees. */ static unsigned long effective_protection(unsigned long usage, unsigned long parent_usage, unsigned long setting, unsigned long parent_effective, unsigned long siblings_protected, bool recursive_protection) { unsigned long protected; unsigned long ep; protected = min(usage, setting); /* * If all cgroups at this level combined claim and use more * protection than what the parent affords them, distribute * shares in proportion to utilization. * * We are using actual utilization rather than the statically * claimed protection in order to be work-conserving: claimed * but unused protection is available to siblings that would * otherwise get a smaller chunk than what they claimed. */ if (siblings_protected > parent_effective) return protected * parent_effective / siblings_protected; /* * Ok, utilized protection of all children is within what the * parent affords them, so we know whatever this child claims * and utilizes is effectively protected. * * If there is unprotected usage beyond this value, reclaim * will apply pressure in proportion to that amount. * * If there is unutilized protection, the cgroup will be fully * shielded from reclaim, but we do return a smaller value for * protection than what the group could enjoy in theory. This * is okay. With the overcommit distribution above, effective * protection is always dependent on how memory is actually * consumed among the siblings anyway. */ ep = protected; /* * If the children aren't claiming (all of) the protection * afforded to them by the parent, distribute the remainder in * proportion to the (unprotected) memory of each cgroup. That * way, cgroups that aren't explicitly prioritized wrt each * other compete freely over the allowance, but they are * collectively protected from neighboring trees. * * We're using unprotected memory for the weight so that if * some cgroups DO claim explicit protection, we don't protect * the same bytes twice. * * Check both usage and parent_usage against the respective * protected values. One should imply the other, but they * aren't read atomically - make sure the division is sane. */ if (!recursive_protection) return ep; if (parent_effective > siblings_protected && parent_usage > siblings_protected && usage > protected) { unsigned long unclaimed; unclaimed = parent_effective - siblings_protected; unclaimed *= usage - protected; unclaimed /= parent_usage - siblings_protected; ep += unclaimed; } return ep; } /** * page_counter_calculate_protection - check if memory consumption is in the normal range * @root: the top ancestor of the sub-tree being checked * @counter: the page_counter the counter to update * @recursive_protection: Whether to use memory_recursiveprot behavior. * * Calculates elow/emin thresholds for given page_counter. * * WARNING: This function is not stateless! It can only be used as part * of a top-down tree iteration, not for isolated queries. */ void page_counter_calculate_protection(struct page_counter *root, struct page_counter *counter, bool recursive_protection) { unsigned long usage, parent_usage; struct page_counter *parent = counter->parent; /* * Effective values of the reclaim targets are ignored so they * can be stale. Have a look at mem_cgroup_protection for more * details. * TODO: calculation should be more robust so that we do not need * that special casing. */ if (root == counter) return; usage = page_counter_read(counter); if (!usage) return; if (parent == root) { counter->emin = READ_ONCE(counter->min); counter->elow = READ_ONCE(counter->low); return; } parent_usage = page_counter_read(parent); WRITE_ONCE(counter->emin, effective_protection(usage, parent_usage, READ_ONCE(counter->min), READ_ONCE(parent->emin), atomic_long_read(&parent->children_min_usage), recursive_protection)); WRITE_ONCE(counter->elow, effective_protection(usage, parent_usage, READ_ONCE(counter->low), READ_ONCE(parent->elow), atomic_long_read(&parent->children_low_usage), recursive_protection)); } #endif /* CONFIG_MEMCG || CONFIG_CGROUP_DMEM */ |
| 1156 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | /* SPDX-License-Identifier: GPL-2.0 */ #undef TRACE_SYSTEM #define TRACE_SYSTEM bpf_test_run #if !defined(_TRACE_BPF_TEST_RUN_H) || defined(TRACE_HEADER_MULTI_READ) #define _TRACE_BPF_TEST_RUN_H #include <linux/tracepoint.h> TRACE_EVENT(bpf_trigger_tp, TP_PROTO(int nonce), TP_ARGS(nonce), TP_STRUCT__entry( __field(int, nonce) ), TP_fast_assign( __entry->nonce = nonce; ), TP_printk("nonce %d", __entry->nonce) ); DECLARE_EVENT_CLASS(bpf_test_finish, TP_PROTO(int *err), TP_ARGS(err), TP_STRUCT__entry( __field(int, err) ), TP_fast_assign( __entry->err = *err; ), TP_printk("bpf_test_finish with err=%d", __entry->err) ); #ifdef DEFINE_EVENT_WRITABLE #undef BPF_TEST_RUN_DEFINE_EVENT #define BPF_TEST_RUN_DEFINE_EVENT(template, call, proto, args, size) \ DEFINE_EVENT_WRITABLE(template, call, PARAMS(proto), \ PARAMS(args), size) #else #undef BPF_TEST_RUN_DEFINE_EVENT #define BPF_TEST_RUN_DEFINE_EVENT(template, call, proto, args, size) \ DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args)) #endif BPF_TEST_RUN_DEFINE_EVENT(bpf_test_finish, bpf_test_finish, TP_PROTO(int *err), TP_ARGS(err), sizeof(int) ); #endif /* This part must be outside protection */ #include <trace/define_trace.h> |
| 692 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 | /* SPDX-License-Identifier: GPL-2.0-or-later */ /* * Cryptographic API for algorithms (i.e., low-level API). * * Copyright (c) 2006 Herbert Xu <herbert@gondor.apana.org.au> */ #ifndef _CRYPTO_ALGAPI_H #define _CRYPTO_ALGAPI_H #include <crypto/utils.h> #include <linux/align.h> #include <linux/cache.h> #include <linux/crypto.h> #include <linux/list.h> #include <linux/types.h> #include <linux/workqueue.h> /* * Maximum values for blocksize and alignmask, used to allocate * static buffers that are big enough for any combination of * algs and architectures. Ciphers have a lower maximum size. */ #define MAX_ALGAPI_BLOCKSIZE 160 #define MAX_ALGAPI_ALIGNMASK 127 #define MAX_CIPHER_BLOCKSIZE 16 #define MAX_CIPHER_ALIGNMASK 15 #ifdef ARCH_DMA_MINALIGN #define CRYPTO_DMA_ALIGN ARCH_DMA_MINALIGN #else #define CRYPTO_DMA_ALIGN CRYPTO_MINALIGN #endif #define CRYPTO_DMA_PADDING ((CRYPTO_DMA_ALIGN - 1) & ~(CRYPTO_MINALIGN - 1)) /* * Autoloaded crypto modules should only use a prefixed name to avoid allowing * arbitrary modules to be loaded. Loading from userspace may still need the * unprefixed names, so retains those aliases as well. * This uses __MODULE_INFO directly instead of MODULE_ALIAS because pre-4.3 * gcc (e.g. avr32 toolchain) uses __LINE__ for uniqueness, and this macro * expands twice on the same line. Instead, use a separate base name for the * alias. */ #define MODULE_ALIAS_CRYPTO(name) \ MODULE_INFO(alias, name); \ MODULE_INFO(alias, "crypto-" name) struct crypto_aead; struct crypto_instance; struct module; struct notifier_block; struct rtattr; struct scatterlist; struct seq_file; struct sk_buff; union crypto_no_such_thing; struct crypto_instance { struct crypto_alg alg; struct crypto_template *tmpl; union { /* Node in list of instances after registration. */ struct hlist_node list; /* List of attached spawns before registration. */ struct crypto_spawn *spawns; }; void *__ctx[] CRYPTO_MINALIGN_ATTR; }; struct crypto_template { struct list_head list; struct hlist_head instances; struct hlist_head dead; struct module *module; struct work_struct free_work; int (*create)(struct crypto_template *tmpl, struct rtattr **tb); char name[CRYPTO_MAX_ALG_NAME]; }; struct crypto_spawn { struct list_head list; struct crypto_alg *alg; union { /* Back pointer to instance after registration.*/ struct crypto_instance *inst; /* Spawn list pointer prior to registration. */ struct crypto_spawn *next; }; const struct crypto_type *frontend; u32 mask; bool dead; bool registered; }; struct crypto_queue { struct list_head list; struct list_head *backlog; unsigned int qlen; unsigned int max_qlen; }; struct crypto_attr_alg { char name[CRYPTO_MAX_ALG_NAME]; }; struct crypto_attr_type { u32 type; u32 mask; }; /* * Algorithm registration interface. */ int crypto_register_alg(struct crypto_alg *alg); void crypto_unregister_alg(struct crypto_alg *alg); int crypto_register_algs(struct crypto_alg *algs, int count); void crypto_unregister_algs(struct crypto_alg *algs, int count); void crypto_mod_put(struct crypto_alg *alg); int crypto_register_template(struct crypto_template *tmpl); int crypto_register_templates(struct crypto_template *tmpls, int count); void crypto_unregister_template(struct crypto_template *tmpl); void crypto_unregister_templates(struct crypto_template *tmpls, int count); struct crypto_template *crypto_lookup_template(const char *name); int crypto_register_instance(struct crypto_template *tmpl, struct crypto_instance *inst); void crypto_unregister_instance(struct crypto_instance *inst); int crypto_grab_spawn(struct crypto_spawn *spawn, struct crypto_instance *inst, const char *name, u32 type, u32 mask); void crypto_drop_spawn(struct crypto_spawn *spawn); struct crypto_tfm *crypto_spawn_tfm(struct crypto_spawn *spawn, u32 type, u32 mask); void *crypto_spawn_tfm2(struct crypto_spawn *spawn); struct crypto_attr_type *crypto_get_attr_type(struct rtattr **tb); int crypto_check_attr_type(struct rtattr **tb, u32 type, u32 *mask_ret); const char *crypto_attr_alg_name(struct rtattr *rta); int __crypto_inst_setname(struct crypto_instance *inst, const char *name, const char *driver, struct crypto_alg *alg); #define crypto_inst_setname(inst, name, ...) \ CONCATENATE(crypto_inst_setname_, COUNT_ARGS(__VA_ARGS__))( \ inst, name, ##__VA_ARGS__) #define crypto_inst_setname_1(inst, name, alg) \ __crypto_inst_setname(inst, name, name, alg) #define crypto_inst_setname_2(inst, name, driver, alg) \ __crypto_inst_setname(inst, name, driver, alg) void crypto_init_queue(struct crypto_queue *queue, unsigned int max_qlen); int crypto_enqueue_request(struct crypto_queue *queue, struct crypto_async_request *request); void crypto_enqueue_request_head(struct crypto_queue *queue, struct crypto_async_request *request); struct crypto_async_request *crypto_dequeue_request(struct crypto_queue *queue); static inline unsigned int crypto_queue_len(struct crypto_queue *queue) { return queue->qlen; } void crypto_inc(u8 *a, unsigned int size); static inline void *crypto_tfm_ctx(struct crypto_tfm *tfm) { return tfm->__crt_ctx; } static inline void *crypto_tfm_ctx_align(struct crypto_tfm *tfm, unsigned int align) { if (align <= crypto_tfm_ctx_alignment()) align = 1; return PTR_ALIGN(crypto_tfm_ctx(tfm), align); } static inline unsigned int crypto_dma_align(void) { return CRYPTO_DMA_ALIGN; } static inline unsigned int crypto_dma_padding(void) { return (crypto_dma_align() - 1) & ~(crypto_tfm_ctx_alignment() - 1); } static inline void *crypto_tfm_ctx_dma(struct crypto_tfm *tfm) { return crypto_tfm_ctx_align(tfm, crypto_dma_align()); } static inline struct crypto_instance *crypto_tfm_alg_instance( struct crypto_tfm *tfm) { return container_of(tfm->__crt_alg, struct crypto_instance, alg); } static inline void *crypto_instance_ctx(struct crypto_instance *inst) { return inst->__ctx; } static inline struct crypto_async_request *crypto_get_backlog( struct crypto_queue *queue) { return queue->backlog == &queue->list ? NULL : container_of(queue->backlog, struct crypto_async_request, list); } static inline u32 crypto_requires_off(struct crypto_attr_type *algt, u32 off) { return (algt->type ^ off) & algt->mask & off; } /* * When an algorithm uses another algorithm (e.g., if it's an instance of a * template), these are the flags that should always be set on the "outer" * algorithm if any "inner" algorithm has them set. */ #define CRYPTO_ALG_INHERITED_FLAGS \ (CRYPTO_ALG_ASYNC | CRYPTO_ALG_NEED_FALLBACK | \ CRYPTO_ALG_ALLOCATES_MEMORY) /* * Given the type and mask that specify the flags restrictions on a template * instance being created, return the mask that should be passed to * crypto_grab_*() (along with type=0) to honor any request the user made to * have any of the CRYPTO_ALG_INHERITED_FLAGS clear. */ static inline u32 crypto_algt_inherited_mask(struct crypto_attr_type *algt) { return crypto_requires_off(algt, CRYPTO_ALG_INHERITED_FLAGS); } int crypto_register_notifier(struct notifier_block *nb); int crypto_unregister_notifier(struct notifier_block *nb); /* Crypto notification events. */ enum { CRYPTO_MSG_ALG_REQUEST, CRYPTO_MSG_ALG_REGISTER, CRYPTO_MSG_ALG_LOADED, }; static inline void crypto_request_complete(struct crypto_async_request *req, int err) { req->complete(req->data, err); } static inline u32 crypto_tfm_alg_type(struct crypto_tfm *tfm) { return tfm->__crt_alg->cra_flags & CRYPTO_ALG_TYPE_MASK; } static inline bool crypto_tfm_req_virt(struct crypto_tfm *tfm) { return tfm->__crt_alg->cra_flags & CRYPTO_ALG_REQ_VIRT; } static inline u32 crypto_request_flags(struct crypto_async_request *req) { return req->flags & ~CRYPTO_TFM_REQ_ON_STACK; } #endif /* _CRYPTO_ALGAPI_H */ |
| 73 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 | /* SPDX-License-Identifier: GPL-2.0-only */ /* * Dynamic loading of modules into the kernel. * * Rewritten by Richard Henderson <rth@tamu.edu> Dec 1996 * Rewritten again by Rusty Russell, 2002 */ #ifndef _LINUX_MODULE_H #define _LINUX_MODULE_H #include <linux/list.h> #include <linux/stat.h> #include <linux/buildid.h> #include <linux/compiler.h> #include <linux/cache.h> #include <linux/cleanup.h> #include <linux/kmod.h> #include <linux/init.h> #include <linux/elf.h> #include <linux/stringify.h> #include <linux/kobject.h> #include <linux/moduleparam.h> #include <linux/jump_label.h> #include <linux/export.h> #include <linux/rbtree_latch.h> #include <linux/error-injection.h> #include <linux/tracepoint-defs.h> #include <linux/srcu.h> #include <linux/static_call_types.h> #include <linux/dynamic_debug.h> #include <linux/percpu.h> #include <asm/module.h> #define MODULE_NAME_LEN __MODULE_NAME_LEN struct modversion_info { unsigned long crc; char name[MODULE_NAME_LEN]; }; struct module; struct exception_table_entry; struct module_kobject { struct kobject kobj; struct module *mod; struct kobject *drivers_dir; struct module_param_attrs *mp; struct completion *kobj_completion; } __randomize_layout; struct module_attribute { struct attribute attr; ssize_t (*show)(const struct module_attribute *, struct module_kobject *, char *); ssize_t (*store)(const struct module_attribute *, struct module_kobject *, const char *, size_t count); void (*setup)(struct module *, const char *); int (*test)(struct module *); void (*free)(struct module *); }; struct module_version_attribute { struct module_attribute mattr; const char *module_name; const char *version; }; extern ssize_t __modver_version_show(const struct module_attribute *, struct module_kobject *, char *); extern const struct module_attribute module_uevent; /* These are either module local, or the kernel's dummy ones. */ extern int init_module(void); extern void cleanup_module(void); #ifndef MODULE /** * module_init() - driver initialization entry point * @x: function to be run at kernel boot time or module insertion * * module_init() will either be called during do_initcalls() (if * builtin) or at module insertion time (if a module). There can only * be one per module. */ #define module_init(x) __initcall(x); /** * module_exit() - driver exit entry point * @x: function to be run when driver is removed * * module_exit() will wrap the driver clean-up code * with cleanup_module() when used with rmmod when * the driver is a module. If the driver is statically * compiled into the kernel, module_exit() has no effect. * There can only be one per module. */ #define module_exit(x) __exitcall(x); #else /* MODULE */ /* * In most cases loadable modules do not need custom * initcall levels. There are still some valid cases where * a driver may be needed early if built in, and does not * matter when built as a loadable module. Like bus * snooping debug drivers. */ #define early_initcall(fn) module_init(fn) #define core_initcall(fn) module_init(fn) #define core_initcall_sync(fn) module_init(fn) #define postcore_initcall(fn) module_init(fn) #define postcore_initcall_sync(fn) module_init(fn) #define arch_initcall(fn) module_init(fn) #define subsys_initcall(fn) module_init(fn) #define subsys_initcall_sync(fn) module_init(fn) #define fs_initcall(fn) module_init(fn) #define fs_initcall_sync(fn) module_init(fn) #define rootfs_initcall(fn) module_init(fn) #define device_initcall(fn) module_init(fn) #define device_initcall_sync(fn) module_init(fn) #define late_initcall(fn) module_init(fn) #define late_initcall_sync(fn) module_init(fn) #define console_initcall(fn) module_init(fn) /* Each module must use one module_init(). */ #define module_init(initfn) \ static inline initcall_t __maybe_unused __inittest(void) \ { return initfn; } \ int init_module(void) __copy(initfn) \ __attribute__((alias(#initfn))); \ ___ADDRESSABLE(init_module, __initdata); /* This is only required if you want to be unloadable. */ #define module_exit(exitfn) \ static inline exitcall_t __maybe_unused __exittest(void) \ { return exitfn; } \ void cleanup_module(void) __copy(exitfn) \ __attribute__((alias(#exitfn))); \ ___ADDRESSABLE(cleanup_module, __exitdata); #endif /* This means "can be init if no module support, otherwise module load may call it." */ #ifdef CONFIG_MODULES #define __init_or_module #define __initdata_or_module #define __initconst_or_module #define __INIT_OR_MODULE .text #define __INITDATA_OR_MODULE .data #define __INITRODATA_OR_MODULE .section ".rodata","a",%progbits #else #define __init_or_module __init #define __initdata_or_module __initdata #define __initconst_or_module __initconst #define __INIT_OR_MODULE __INIT #define __INITDATA_OR_MODULE __INITDATA #define __INITRODATA_OR_MODULE __INITRODATA #endif /*CONFIG_MODULES*/ struct module_kobject *lookup_or_create_module_kobject(const char *name); /* For userspace: you can also call me... */ #define MODULE_ALIAS(_alias) MODULE_INFO(alias, _alias) /* Soft module dependencies. See man modprobe.d for details. * Example: MODULE_SOFTDEP("pre: module-foo module-bar post: module-baz") */ #define MODULE_SOFTDEP(_softdep) MODULE_INFO(softdep, _softdep) /* * Weak module dependencies. See man modprobe.d for details. * Example: MODULE_WEAKDEP("module-foo") */ #define MODULE_WEAKDEP(_weakdep) MODULE_INFO(weakdep, _weakdep) /* * MODULE_FILE is used for generating modules.builtin * So, make it no-op when this is being built as a module */ #ifdef MODULE #define MODULE_FILE #else #define MODULE_FILE MODULE_INFO(file, KBUILD_MODFILE); #endif /* * The following license idents are currently accepted as indicating free * software modules * * "GPL" [GNU Public License v2] * "GPL v2" [GNU Public License v2] * "GPL and additional rights" [GNU Public License v2 rights and more] * "Dual BSD/GPL" [GNU Public License v2 * or BSD license choice] * "Dual MIT/GPL" [GNU Public License v2 * or MIT license choice] * "Dual MPL/GPL" [GNU Public License v2 * or Mozilla license choice] * * The following other idents are available * * "Proprietary" [Non free products] * * Both "GPL v2" and "GPL" (the latter also in dual licensed strings) are * merely stating that the module is licensed under the GPL v2, but are not * telling whether "GPL v2 only" or "GPL v2 or later". The reason why there * are two variants is a historic and failed attempt to convey more * information in the MODULE_LICENSE string. For module loading the * "only/or later" distinction is completely irrelevant and does neither * replace the proper license identifiers in the corresponding source file * nor amends them in any way. The sole purpose is to make the * 'Proprietary' flagging work and to refuse to bind symbols which are * exported with EXPORT_SYMBOL_GPL when a non free module is loaded. * * In the same way "BSD" is not a clear license information. It merely * states, that the module is licensed under one of the compatible BSD * license variants. The detailed and correct license information is again * to be found in the corresponding source files. * * There are dual licensed components, but when running with Linux it is the * GPL that is relevant so this is a non issue. Similarly LGPL linked with GPL * is a GPL combined work. * * This exists for several reasons * 1. So modinfo can show license info for users wanting to vet their setup * is free * 2. So the community can ignore bug reports including proprietary modules * 3. So vendors can do likewise based on their own policies */ #define MODULE_LICENSE(_license) MODULE_FILE MODULE_INFO(license, _license) /* * Author(s), use "Name <email>" or just "Name", for multiple * authors use multiple MODULE_AUTHOR() statements/lines. */ #define MODULE_AUTHOR(_author) MODULE_INFO(author, _author) /* What your module does. */ #define MODULE_DESCRIPTION(_description) MODULE_INFO(description, _description) #ifdef MODULE /* Creates an alias so file2alias.c can find device table. */ #define MODULE_DEVICE_TABLE(type, name) \ static typeof(name) __mod_device_table__##type##__##name \ __attribute__ ((used, alias(__stringify(name)))) #else /* !MODULE */ #define MODULE_DEVICE_TABLE(type, name) #endif /* Version of form [<epoch>:]<version>[-<extra-version>]. * Or for CVS/RCS ID version, everything but the number is stripped. * <epoch>: A (small) unsigned integer which allows you to start versions * anew. If not mentioned, it's zero. eg. "2:1.0" is after * "1:2.0". * <version>: The <version> may contain only alphanumerics and the * character `.'. Ordered by numeric sort for numeric parts, * ascii sort for ascii parts (as per RPM or DEB algorithm). * <extraversion>: Like <version>, but inserted for local * customizations, eg "rh3" or "rusty1". * Using this automatically adds a checksum of the .c files and the * local headers in "srcversion". */ #if defined(MODULE) || !defined(CONFIG_SYSFS) #define MODULE_VERSION(_version) MODULE_INFO(version, _version) #else #define MODULE_VERSION(_version) \ MODULE_INFO(version, _version); \ static const struct module_version_attribute __modver_attr \ __used __section("__modver") \ __aligned(__alignof__(struct module_version_attribute)) \ = { \ .mattr = { \ .attr = { \ .name = "version", \ .mode = S_IRUGO, \ }, \ .show = __modver_version_show, \ }, \ .module_name = KBUILD_MODNAME, \ .version = _version, \ } #endif /* Optional firmware file (or files) needed by the module * format is simply firmware file name. Multiple firmware * files require multiple MODULE_FIRMWARE() specifiers */ #define MODULE_FIRMWARE(_firmware) MODULE_INFO(firmware, _firmware) #define MODULE_IMPORT_NS(ns) MODULE_INFO(import_ns, ns) struct notifier_block; enum module_state { MODULE_STATE_LIVE, /* Normal state. */ MODULE_STATE_COMING, /* Full formed, running module_init. */ MODULE_STATE_GOING, /* Going away. */ MODULE_STATE_UNFORMED, /* Still setting it up. */ }; struct mod_tree_node { struct module *mod; struct latch_tree_node node; }; enum mod_mem_type { MOD_TEXT = 0, MOD_DATA, MOD_RODATA, MOD_RO_AFTER_INIT, MOD_INIT_TEXT, MOD_INIT_DATA, MOD_INIT_RODATA, MOD_MEM_NUM_TYPES, MOD_INVALID = -1, }; #define mod_mem_type_is_init(type) \ ((type) == MOD_INIT_TEXT || \ (type) == MOD_INIT_DATA || \ (type) == MOD_INIT_RODATA) #define mod_mem_type_is_core(type) (!mod_mem_type_is_init(type)) #define mod_mem_type_is_text(type) \ ((type) == MOD_TEXT || \ (type) == MOD_INIT_TEXT) #define mod_mem_type_is_data(type) (!mod_mem_type_is_text(type)) #define mod_mem_type_is_core_data(type) \ (mod_mem_type_is_core(type) && \ mod_mem_type_is_data(type)) #define for_each_mod_mem_type(type) \ for (enum mod_mem_type (type) = 0; \ (type) < MOD_MEM_NUM_TYPES; (type)++) #define for_class_mod_mem_type(type, class) \ for_each_mod_mem_type(type) \ if (mod_mem_type_is_##class(type)) struct module_memory { void *base; bool is_rox; unsigned int size; #ifdef CONFIG_MODULES_TREE_LOOKUP struct mod_tree_node mtn; #endif }; #ifdef CONFIG_MODULES_TREE_LOOKUP /* Only touch one cacheline for common rbtree-for-core-layout case. */ #define __module_memory_align ____cacheline_aligned #else #define __module_memory_align #endif struct mod_kallsyms { Elf_Sym *symtab; unsigned int num_symtab; char *strtab; char *typetab; }; #ifdef CONFIG_LIVEPATCH /** * struct klp_modinfo - ELF information preserved from the livepatch module * * @hdr: ELF header * @sechdrs: Section header table * @secstrings: String table for the section headers * @symndx: The symbol table section index */ struct klp_modinfo { Elf_Ehdr hdr; Elf_Shdr *sechdrs; char *secstrings; unsigned int symndx; }; #endif struct module { enum module_state state; /* Member of list of modules */ struct list_head list; /* Unique handle for this module */ char name[MODULE_NAME_LEN]; #ifdef CONFIG_STACKTRACE_BUILD_ID /* Module build ID */ unsigned char build_id[BUILD_ID_SIZE_MAX]; #endif /* Sysfs stuff. */ struct module_kobject mkobj; struct module_attribute *modinfo_attrs; const char *version; const char *srcversion; struct kobject *holders_dir; /* Exported symbols */ const struct kernel_symbol *syms; const u32 *crcs; unsigned int num_syms; #ifdef CONFIG_ARCH_USES_CFI_TRAPS s32 *kcfi_traps; s32 *kcfi_traps_end; #endif /* Kernel parameters. */ #ifdef CONFIG_SYSFS struct mutex param_lock; #endif struct kernel_param *kp; unsigned int num_kp; /* GPL-only exported symbols. */ unsigned int num_gpl_syms; const struct kernel_symbol *gpl_syms; const u32 *gpl_crcs; bool using_gplonly_symbols; #ifdef CONFIG_MODULE_SIG /* Signature was verified. */ bool sig_ok; #endif bool async_probe_requested; /* Exception table */ unsigned int num_exentries; struct exception_table_entry *extable; /* Startup function. */ int (*init)(void); struct module_memory mem[MOD_MEM_NUM_TYPES] __module_memory_align; /* Arch-specific module values */ struct mod_arch_specific arch; unsigned long taints; /* same bits as kernel:taint_flags */ #ifdef CONFIG_GENERIC_BUG /* Support for BUG */ unsigned num_bugs; struct list_head bug_list; struct bug_entry *bug_table; #endif #ifdef CONFIG_KALLSYMS /* Protected by RCU and/or module_mutex: use rcu_dereference() */ struct mod_kallsyms __rcu *kallsyms; struct mod_kallsyms core_kallsyms; /* Section attributes */ struct module_sect_attrs *sect_attrs; /* Notes attributes */ struct module_notes_attrs *notes_attrs; #endif /* The command line arguments (may be mangled). People like keeping pointers to this stuff */ char *args; #ifdef CONFIG_SMP /* Per-cpu data. */ void __percpu *percpu; unsigned int percpu_size; #endif void *noinstr_text_start; unsigned int noinstr_text_size; #ifdef CONFIG_TRACEPOINTS unsigned int num_tracepoints; tracepoint_ptr_t *tracepoints_ptrs; #endif #ifdef CONFIG_TREE_SRCU unsigned int num_srcu_structs; struct srcu_struct **srcu_struct_ptrs; #endif #ifdef CONFIG_BPF_EVENTS unsigned int num_bpf_raw_events; struct bpf_raw_event_map *bpf_raw_events; #endif #ifdef CONFIG_DEBUG_INFO_BTF_MODULES unsigned int btf_data_size; unsigned int btf_base_data_size; void *btf_data; void *btf_base_data; #endif #ifdef CONFIG_JUMP_LABEL struct jump_entry *jump_entries; unsigned int num_jump_entries; #endif #ifdef CONFIG_TRACING unsigned int num_trace_bprintk_fmt; const char **trace_bprintk_fmt_start; #endif #ifdef CONFIG_EVENT_TRACING struct trace_event_call **trace_events; unsigned int num_trace_events; struct trace_eval_map **trace_evals; unsigned int num_trace_evals; #endif #ifdef CONFIG_DYNAMIC_FTRACE unsigned int num_ftrace_callsites; unsigned long *ftrace_callsites; #endif #ifdef CONFIG_KPROBES void *kprobes_text_start; unsigned int kprobes_text_size; unsigned long *kprobe_blacklist; unsigned int num_kprobe_blacklist; #endif #ifdef CONFIG_HAVE_STATIC_CALL_INLINE int num_static_call_sites; struct static_call_site *static_call_sites; #endif #if IS_ENABLED(CONFIG_KUNIT) int num_kunit_init_suites; struct kunit_suite **kunit_init_suites; int num_kunit_suites; struct kunit_suite **kunit_suites; #endif #ifdef CONFIG_LIVEPATCH bool klp; /* Is this a livepatch module? */ bool klp_alive; /* ELF information */ struct klp_modinfo *klp_info; #endif #ifdef CONFIG_PRINTK_INDEX unsigned int printk_index_size; struct pi_entry **printk_index_start; #endif #ifdef CONFIG_MODULE_UNLOAD /* What modules depend on me? */ struct list_head source_list; /* What modules do I depend on? */ struct list_head target_list; /* Destruction function. */ void (*exit)(void); atomic_t refcnt; #endif #ifdef CONFIG_CONSTRUCTORS /* Constructor functions. */ ctor_fn_t *ctors; unsigned int num_ctors; #endif #ifdef CONFIG_FUNCTION_ERROR_INJECTION struct error_injection_entry *ei_funcs; unsigned int num_ei_funcs; #endif #ifdef CONFIG_DYNAMIC_DEBUG_CORE struct _ddebug_info dyndbg_info; #endif } ____cacheline_aligned __randomize_layout; #ifndef MODULE_ARCH_INIT #define MODULE_ARCH_INIT {} #endif #ifdef CONFIG_MODULES /* Get/put a kernel symbol (calls must be symmetric) */ void *__symbol_get(const char *symbol); void *__symbol_get_gpl(const char *symbol); #define symbol_get(x) ({ \ static const char __notrim[] \ __used __section(".no_trim_symbol") = __stringify(x); \ (typeof(&x))(__symbol_get(__stringify(x))); }) #ifndef HAVE_ARCH_KALLSYMS_SYMBOL_VALUE static inline unsigned long kallsyms_symbol_value(const Elf_Sym *sym) { return sym->st_value; } #endif /* FIXME: It'd be nice to isolate modules during init, too, so they aren't used before they (may) fail. But presently too much code (IDE & SCSI) require entry into the module during init.*/ static inline bool module_is_live(struct module *mod) { return mod->state != MODULE_STATE_GOING; } static inline bool module_is_coming(struct module *mod) { return mod->state == MODULE_STATE_COMING; } struct module *__module_text_address(unsigned long addr); struct module *__module_address(unsigned long addr); bool is_module_address(unsigned long addr); bool __is_module_percpu_address(unsigned long addr, unsigned long *can_addr); bool is_module_percpu_address(unsigned long addr); bool is_module_text_address(unsigned long addr); static inline bool within_module_mem_type(unsigned long addr, const struct module *mod, enum mod_mem_type type) { unsigned long base, size; base = (unsigned long)mod->mem[type].base; size = mod->mem[type].size; return addr - base < size; } static inline bool within_module_core(unsigned long addr, const struct module *mod) { for_class_mod_mem_type(type, core) { if (within_module_mem_type(addr, mod, type)) return true; } return false; } static inline bool within_module_init(unsigned long addr, const struct module *mod) { for_class_mod_mem_type(type, init) { if (within_module_mem_type(addr, mod, type)) return true; } return false; } static inline bool within_module(unsigned long addr, const struct module *mod) { return within_module_init(addr, mod) || within_module_core(addr, mod); } /* Search for module by name: must be in a RCU critical section. */ struct module *find_module(const char *name); extern void __noreturn __module_put_and_kthread_exit(struct module *mod, long code); #define module_put_and_kthread_exit(code) __module_put_and_kthread_exit(THIS_MODULE, code) #ifdef CONFIG_MODULE_UNLOAD int module_refcount(struct module *mod); void __symbol_put(const char *symbol); #define symbol_put(x) __symbol_put(__stringify(x)) void symbol_put_addr(void *addr); /* Sometimes we know we already have a refcount, and it's easier not to handle the error case (which only happens with rmmod --wait). */ extern void __module_get(struct module *module); /** * try_module_get() - take module refcount unless module is being removed * @module: the module we should check for * * Only try to get a module reference count if the module is not being removed. * This call will fail if the module is in the process of being removed. * * Care must also be taken to ensure the module exists and is alive prior to * usage of this call. This can be gauranteed through two means: * * 1) Direct protection: you know an earlier caller must have increased the * module reference through __module_get(). This can typically be achieved * by having another entity other than the module itself increment the * module reference count. * * 2) Implied protection: there is an implied protection against module * removal. An example of this is the implied protection used by kernfs / * sysfs. The sysfs store / read file operations are guaranteed to exist * through the use of kernfs's active reference (see kernfs_active()) and a * sysfs / kernfs file removal cannot happen unless the same file is not * active. Therefore, if a sysfs file is being read or written to the module * which created it must still exist. It is therefore safe to use * try_module_get() on module sysfs store / read ops. * * One of the real values to try_module_get() is the module_is_live() check * which ensures that the caller of try_module_get() can yield to userspace * module removal requests and gracefully fail if the module is on its way out. * * Returns true if the reference count was successfully incremented. */ extern bool try_module_get(struct module *module); /** * module_put() - release a reference count to a module * @module: the module we should release a reference count for * * If you successfully bump a reference count to a module with try_module_get(), * when you are finished you must call module_put() to release that reference * count. */ extern void module_put(struct module *module); #else /*!CONFIG_MODULE_UNLOAD*/ static inline bool try_module_get(struct module *module) { return !module || module_is_live(module); } static inline void module_put(struct module *module) { } static inline void __module_get(struct module *module) { } #define symbol_put(x) do { } while (0) #define symbol_put_addr(p) do { } while (0) #endif /* CONFIG_MODULE_UNLOAD */ /* This is a #define so the string doesn't get put in every .o file */ #define module_name(mod) \ ({ \ struct module *__mod = (mod); \ __mod ? __mod->name : "kernel"; \ }) /* Dereference module function descriptor */ void *dereference_module_function_descriptor(struct module *mod, void *ptr); int register_module_notifier(struct notifier_block *nb); int unregister_module_notifier(struct notifier_block *nb); extern void print_modules(void); static inline bool module_requested_async_probing(struct module *module) { return module && module->async_probe_requested; } static inline bool is_livepatch_module(struct module *mod) { #ifdef CONFIG_LIVEPATCH return mod->klp; #else return false; #endif } void set_module_sig_enforced(void); void module_for_each_mod(int(*func)(struct module *mod, void *data), void *data); #else /* !CONFIG_MODULES... */ static inline struct module *__module_address(unsigned long addr) { return NULL; } static inline struct module *__module_text_address(unsigned long addr) { return NULL; } static inline bool is_module_address(unsigned long addr) { return false; } static inline bool is_module_percpu_address(unsigned long addr) { return false; } static inline bool __is_module_percpu_address(unsigned long addr, unsigned long *can_addr) { return false; } static inline bool is_module_text_address(unsigned long addr) { return false; } static inline bool within_module_core(unsigned long addr, const struct module *mod) { return false; } static inline bool within_module_init(unsigned long addr, const struct module *mod) { return false; } static inline bool within_module(unsigned long addr, const struct module *mod) { return false; } /* Get/put a kernel symbol (calls should be symmetric) */ #define symbol_get(x) ({ extern typeof(x) x __attribute__((weak,visibility("hidden"))); &(x); }) #define symbol_put(x) do { } while (0) #define symbol_put_addr(x) do { } while (0) static inline void __module_get(struct module *module) { } static inline bool try_module_get(struct module *module) { return true; } static inline void module_put(struct module *module) { } #define module_name(mod) "kernel" static inline int register_module_notifier(struct notifier_block *nb) { /* no events will happen anyway, so this can always succeed */ return 0; } static inline int unregister_module_notifier(struct notifier_block *nb) { return 0; } #define module_put_and_kthread_exit(code) kthread_exit(code) static inline void print_modules(void) { } static inline bool module_requested_async_probing(struct module *module) { return false; } static inline void set_module_sig_enforced(void) { } /* Dereference module function descriptor */ static inline void *dereference_module_function_descriptor(struct module *mod, void *ptr) { return ptr; } static inline bool module_is_coming(struct module *mod) { return false; } static inline void module_for_each_mod(int(*func)(struct module *mod, void *data), void *data) { } #endif /* CONFIG_MODULES */ #ifdef CONFIG_SYSFS extern struct kset *module_kset; extern const struct kobj_type module_ktype; #endif /* CONFIG_SYSFS */ #define symbol_request(x) try_then_request_module(symbol_get(x), "symbol:" #x) /* BELOW HERE ALL THESE ARE OBSOLETE AND WILL VANISH */ #define __MODULE_STRING(x) __stringify(x) #ifdef CONFIG_GENERIC_BUG void module_bug_finalize(const Elf_Ehdr *, const Elf_Shdr *, struct module *); void module_bug_cleanup(struct module *); #else /* !CONFIG_GENERIC_BUG */ static inline void module_bug_finalize(const Elf_Ehdr *hdr, const Elf_Shdr *sechdrs, struct module *mod) { } static inline void module_bug_cleanup(struct module *mod) {} #endif /* CONFIG_GENERIC_BUG */ #ifdef CONFIG_MITIGATION_RETPOLINE extern bool retpoline_module_ok(bool has_retpoline); #else static inline bool retpoline_module_ok(bool has_retpoline) { return true; } #endif #ifdef CONFIG_MODULE_SIG bool is_module_sig_enforced(void); static inline bool module_sig_ok(struct module *module) { return module->sig_ok; } #else /* !CONFIG_MODULE_SIG */ static inline bool is_module_sig_enforced(void) { return false; } static inline bool module_sig_ok(struct module *module) { return true; } #endif /* CONFIG_MODULE_SIG */ #if defined(CONFIG_MODULES) && defined(CONFIG_KALLSYMS) int module_kallsyms_on_each_symbol(const char *modname, int (*fn)(void *, const char *, unsigned long), void *data); /* For kallsyms to ask for address resolution. namebuf should be at * least KSYM_NAME_LEN long: a pointer to namebuf is returned if * found, otherwise NULL. */ int module_address_lookup(unsigned long addr, unsigned long *symbolsize, unsigned long *offset, char **modname, const unsigned char **modbuildid, char *namebuf); int lookup_module_symbol_name(unsigned long addr, char *symname); int lookup_module_symbol_attrs(unsigned long addr, unsigned long *size, unsigned long *offset, char *modname, char *name); /* Returns 0 and fills in value, defined and namebuf, or -ERANGE if * symnum out of range. */ int module_get_kallsym(unsigned int symnum, unsigned long *value, char *type, char *name, char *module_name, int *exported); /* Look for this name: can be of form module:name. */ unsigned long module_kallsyms_lookup_name(const char *name); unsigned long find_kallsyms_symbol_value(struct module *mod, const char *name); #else /* CONFIG_MODULES && CONFIG_KALLSYMS */ static inline int module_kallsyms_on_each_symbol(const char *modname, int (*fn)(void *, const char *, unsigned long), void *data) { return -EOPNOTSUPP; } /* For kallsyms to ask for address resolution. NULL means not found. */ static inline int module_address_lookup(unsigned long addr, unsigned long *symbolsize, unsigned long *offset, char **modname, const unsigned char **modbuildid, char *namebuf) { return 0; } static inline int lookup_module_symbol_name(unsigned long addr, char *symname) { return -ERANGE; } static inline int module_get_kallsym(unsigned int symnum, unsigned long *value, char *type, char *name, char *module_name, int *exported) { return -ERANGE; } static inline unsigned long module_kallsyms_lookup_name(const char *name) { return 0; } static inline unsigned long find_kallsyms_symbol_value(struct module *mod, const char *name) { return 0; } #endif /* CONFIG_MODULES && CONFIG_KALLSYMS */ /* Define __free(module_put) macro for struct module *. */ DEFINE_FREE(module_put, struct module *, if (_T) module_put(_T)) #endif /* _LINUX_MODULE_H */ |
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 | /* SPDX-License-Identifier: GPL-2.0-or-later */ #ifndef __CPUSET_INTERNAL_H #define __CPUSET_INTERNAL_H #include <linux/cgroup.h> #include <linux/cpu.h> #include <linux/cpumask.h> #include <linux/cpuset.h> #include <linux/spinlock.h> #include <linux/union_find.h> /* See "Frequency meter" comments, below. */ struct fmeter { int cnt; /* unprocessed events count */ int val; /* most recent output value */ time64_t time; /* clock (secs) when val computed */ spinlock_t lock; /* guards read or write of above */ }; /* * Invalid partition error code */ enum prs_errcode { PERR_NONE = 0, PERR_INVCPUS, PERR_INVPARENT, PERR_NOTPART, PERR_NOTEXCL, PERR_NOCPUS, PERR_HOTPLUG, PERR_CPUSEMPTY, PERR_HKEEPING, PERR_ACCESS, PERR_REMOTE, }; /* bits in struct cpuset flags field */ typedef enum { CS_ONLINE, CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE, CS_MEM_HARDWALL, CS_MEMORY_MIGRATE, CS_SCHED_LOAD_BALANCE, CS_SPREAD_PAGE, CS_SPREAD_SLAB, } cpuset_flagbits_t; /* The various types of files and directories in a cpuset file system */ typedef enum { FILE_MEMORY_MIGRATE, FILE_CPULIST, FILE_MEMLIST, FILE_EFFECTIVE_CPULIST, FILE_EFFECTIVE_MEMLIST, FILE_SUBPARTS_CPULIST, FILE_EXCLUSIVE_CPULIST, FILE_EFFECTIVE_XCPULIST, FILE_ISOLATED_CPULIST, FILE_CPU_EXCLUSIVE, FILE_MEM_EXCLUSIVE, FILE_MEM_HARDWALL, FILE_SCHED_LOAD_BALANCE, FILE_PARTITION_ROOT, FILE_SCHED_RELAX_DOMAIN_LEVEL, FILE_MEMORY_PRESSURE_ENABLED, FILE_MEMORY_PRESSURE, FILE_SPREAD_PAGE, FILE_SPREAD_SLAB, } cpuset_filetype_t; struct cpuset { struct cgroup_subsys_state css; unsigned long flags; /* "unsigned long" so bitops work */ /* * On default hierarchy: * * The user-configured masks can only be changed by writing to * cpuset.cpus and cpuset.mems, and won't be limited by the * parent masks. * * The effective masks is the real masks that apply to the tasks * in the cpuset. They may be changed if the configured masks are * changed or hotplug happens. * * effective_mask == configured_mask & parent's effective_mask, * and if it ends up empty, it will inherit the parent's mask. * * * On legacy hierarchy: * * The user-configured masks are always the same with effective masks. */ /* user-configured CPUs and Memory Nodes allow to tasks */ cpumask_var_t cpus_allowed; nodemask_t mems_allowed; /* effective CPUs and Memory Nodes allow to tasks */ cpumask_var_t effective_cpus; nodemask_t effective_mems; /* * Exclusive CPUs dedicated to current cgroup (default hierarchy only) * * The effective_cpus of a valid partition root comes solely from its * effective_xcpus and some of the effective_xcpus may be distributed * to sub-partitions below & hence excluded from its effective_cpus. * For a valid partition root, its effective_cpus have no relationship * with cpus_allowed unless its exclusive_cpus isn't set. * * This value will only be set if either exclusive_cpus is set or * when this cpuset becomes a local partition root. */ cpumask_var_t effective_xcpus; /* * Exclusive CPUs as requested by the user (default hierarchy only) * * Its value is independent of cpus_allowed and designates the set of * CPUs that can be granted to the current cpuset or its children when * it becomes a valid partition root. The effective set of exclusive * CPUs granted (effective_xcpus) depends on whether those exclusive * CPUs are passed down by its ancestors and not yet taken up by * another sibling partition root along the way. * * If its value isn't set, it defaults to cpus_allowed. */ cpumask_var_t exclusive_cpus; /* * This is old Memory Nodes tasks took on. * * - top_cpuset.old_mems_allowed is initialized to mems_allowed. * - A new cpuset's old_mems_allowed is initialized when some * task is moved into it. * - old_mems_allowed is used in cpuset_migrate_mm() when we change * cpuset.mems_allowed and have tasks' nodemask updated, and * then old_mems_allowed is updated to mems_allowed. */ nodemask_t old_mems_allowed; struct fmeter fmeter; /* memory_pressure filter */ /* * Tasks are being attached to this cpuset. Used to prevent * zeroing cpus/mems_allowed between ->can_attach() and ->attach(). */ int attach_in_progress; /* for custom sched domain */ int relax_domain_level; /* number of valid local child partitions */ int nr_subparts; /* partition root state */ int partition_root_state; /* * number of SCHED_DEADLINE tasks attached to this cpuset, so that we * know when to rebuild associated root domain bandwidth information. */ int nr_deadline_tasks; int nr_migrate_dl_tasks; u64 sum_migrate_dl_bw; /* Invalid partition error code, not lock protected */ enum prs_errcode prs_err; /* Handle for cpuset.cpus.partition */ struct cgroup_file partition_file; /* Remote partition silbling list anchored at remote_children */ struct list_head remote_sibling; /* Used to merge intersecting subsets for generate_sched_domains */ struct uf_node node; }; static inline struct cpuset *css_cs(struct cgroup_subsys_state *css) { return css ? container_of(css, struct cpuset, css) : NULL; } /* Retrieve the cpuset for a task */ static inline struct cpuset *task_cs(struct task_struct *task) { return css_cs(task_css(task, cpuset_cgrp_id)); } static inline struct cpuset *parent_cs(struct cpuset *cs) { return css_cs(cs->css.parent); } /* convenient tests for these bits */ static inline bool is_cpuset_online(struct cpuset *cs) { return test_bit(CS_ONLINE, &cs->flags) && !css_is_dying(&cs->css); } static inline int is_cpu_exclusive(const struct cpuset *cs) { return test_bit(CS_CPU_EXCLUSIVE, &cs->flags); } static inline int is_mem_exclusive(const struct cpuset *cs) { return test_bit(CS_MEM_EXCLUSIVE, &cs->flags); } static inline int is_mem_hardwall(const struct cpuset *cs) { return test_bit(CS_MEM_HARDWALL, &cs->flags); } static inline int is_sched_load_balance(const struct cpuset *cs) { return test_bit(CS_SCHED_LOAD_BALANCE, &cs->flags); } static inline int is_memory_migrate(const struct cpuset *cs) { return test_bit(CS_MEMORY_MIGRATE, &cs->flags); } static inline int is_spread_page(const struct cpuset *cs) { return test_bit(CS_SPREAD_PAGE, &cs->flags); } static inline int is_spread_slab(const struct cpuset *cs) { return test_bit(CS_SPREAD_SLAB, &cs->flags); } /** * cpuset_for_each_child - traverse online children of a cpuset * @child_cs: loop cursor pointing to the current child * @pos_css: used for iteration * @parent_cs: target cpuset to walk children of * * Walk @child_cs through the online children of @parent_cs. Must be used * with RCU read locked. */ #define cpuset_for_each_child(child_cs, pos_css, parent_cs) \ css_for_each_child((pos_css), &(parent_cs)->css) \ if (is_cpuset_online(((child_cs) = css_cs((pos_css))))) /** * cpuset_for_each_descendant_pre - pre-order walk of a cpuset's descendants * @des_cs: loop cursor pointing to the current descendant * @pos_css: used for iteration * @root_cs: target cpuset to walk ancestor of * * Walk @des_cs through the online descendants of @root_cs. Must be used * with RCU read locked. The caller may modify @pos_css by calling * css_rightmost_descendant() to skip subtree. @root_cs is included in the * iteration and the first node to be visited. */ #define cpuset_for_each_descendant_pre(des_cs, pos_css, root_cs) \ css_for_each_descendant_pre((pos_css), &(root_cs)->css) \ if (is_cpuset_online(((des_cs) = css_cs((pos_css))))) void rebuild_sched_domains_locked(void); void cpuset_callback_lock_irq(void); void cpuset_callback_unlock_irq(void); void cpuset_update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus); void cpuset_update_tasks_nodemask(struct cpuset *cs); int cpuset_update_flag(cpuset_flagbits_t bit, struct cpuset *cs, int turning_on); ssize_t cpuset_write_resmask(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off); int cpuset_common_seq_show(struct seq_file *sf, void *v); /* * cpuset-v1.c */ #ifdef CONFIG_CPUSETS_V1 extern struct cftype cpuset1_files[]; void fmeter_init(struct fmeter *fmp); void cpuset1_update_task_spread_flags(struct cpuset *cs, struct task_struct *tsk); void cpuset1_update_tasks_flags(struct cpuset *cs); void cpuset1_hotplug_update_tasks(struct cpuset *cs, struct cpumask *new_cpus, nodemask_t *new_mems, bool cpus_updated, bool mems_updated); int cpuset1_validate_change(struct cpuset *cur, struct cpuset *trial); #else static inline void fmeter_init(struct fmeter *fmp) {} static inline void cpuset1_update_task_spread_flags(struct cpuset *cs, struct task_struct *tsk) {} static inline void cpuset1_update_tasks_flags(struct cpuset *cs) {} static inline void cpuset1_hotplug_update_tasks(struct cpuset *cs, struct cpumask *new_cpus, nodemask_t *new_mems, bool cpus_updated, bool mems_updated) {} static inline int cpuset1_validate_change(struct cpuset *cur, struct cpuset *trial) { return 0; } #endif /* CONFIG_CPUSETS_V1 */ #endif /* __CPUSET_INTERNAL_H */ |
| 15 4 15 22 3 16 3 1 2 16 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 | // SPDX-License-Identifier: GPL-2.0-or-later /* * IP Payload Compression Protocol (IPComp) for IPv6 - RFC3173 * * Copyright (C)2003 USAGI/WIDE Project * * Author Mitsuru KANDA <mk@linux-ipv6.org> */ /* * [Memo] * * Outbound: * The compression of IP datagram MUST be done before AH/ESP processing, * fragmentation, and the addition of Hop-by-Hop/Routing header. * * Inbound: * The decompression of IP datagram MUST be done after the reassembly, * AH/ESP processing. */ #define pr_fmt(fmt) "IPv6: " fmt #include <linux/module.h> #include <net/ip.h> #include <net/xfrm.h> #include <net/ipcomp.h> #include <linux/crypto.h> #include <linux/err.h> #include <linux/pfkeyv2.h> #include <linux/random.h> #include <linux/percpu.h> #include <linux/smp.h> #include <linux/list.h> #include <linux/vmalloc.h> #include <linux/rtnetlink.h> #include <net/ip6_route.h> #include <net/icmp.h> #include <net/ipv6.h> #include <net/protocol.h> #include <linux/ipv6.h> #include <linux/icmpv6.h> #include <linux/mutex.h> static int ipcomp6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, u8 type, u8 code, int offset, __be32 info) { struct net *net = dev_net(skb->dev); __be32 spi; const struct ipv6hdr *iph = (const struct ipv6hdr *)skb->data; struct ip_comp_hdr *ipcomph = (struct ip_comp_hdr *)(skb->data + offset); struct xfrm_state *x; if (type != ICMPV6_PKT_TOOBIG && type != NDISC_REDIRECT) return 0; spi = htonl(ntohs(ipcomph->cpi)); x = xfrm_state_lookup(net, skb->mark, (const xfrm_address_t *)&iph->daddr, spi, IPPROTO_COMP, AF_INET6); if (!x) return 0; if (type == NDISC_REDIRECT) ip6_redirect(skb, net, skb->dev->ifindex, 0, sock_net_uid(net, NULL)); else ip6_update_pmtu(skb, net, info, 0, 0, sock_net_uid(net, NULL)); xfrm_state_put(x); return 0; } static struct lock_class_key xfrm_state_lock_key; static struct xfrm_state *ipcomp6_tunnel_create(struct xfrm_state *x) { struct net *net = xs_net(x); struct xfrm_state *t = NULL; t = xfrm_state_alloc(net); if (!t) goto out; lockdep_set_class(&t->lock, &xfrm_state_lock_key); t->id.proto = IPPROTO_IPV6; t->id.spi = xfrm6_tunnel_alloc_spi(net, (xfrm_address_t *)&x->props.saddr); if (!t->id.spi) goto error; memcpy(t->id.daddr.a6, x->id.daddr.a6, sizeof(struct in6_addr)); memcpy(&t->sel, &x->sel, sizeof(t->sel)); t->props.family = AF_INET6; t->props.mode = x->props.mode; memcpy(t->props.saddr.a6, x->props.saddr.a6, sizeof(struct in6_addr)); memcpy(&t->mark, &x->mark, sizeof(t->mark)); t->if_id = x->if_id; if (xfrm_init_state(t)) goto error; atomic_set(&t->tunnel_users, 1); out: return t; error: t->km.state = XFRM_STATE_DEAD; xfrm_state_put(t); t = NULL; goto out; } static int ipcomp6_tunnel_attach(struct xfrm_state *x) { struct net *net = xs_net(x); int err = 0; struct xfrm_state *t = NULL; __be32 spi; u32 mark = x->mark.m & x->mark.v; spi = xfrm6_tunnel_spi_lookup(net, (xfrm_address_t *)&x->props.saddr); if (spi) t = xfrm_state_lookup(net, mark, (xfrm_address_t *)&x->id.daddr, spi, IPPROTO_IPV6, AF_INET6); if (!t) { t = ipcomp6_tunnel_create(x); if (!t) { err = -EINVAL; goto out; } xfrm_state_insert(t); xfrm_state_hold(t); } x->tunnel = t; atomic_inc(&t->tunnel_users); out: return err; } static int ipcomp6_init_state(struct xfrm_state *x, struct netlink_ext_ack *extack) { int err = -EINVAL; x->props.header_len = 0; switch (x->props.mode) { case XFRM_MODE_TRANSPORT: break; case XFRM_MODE_TUNNEL: x->props.header_len += sizeof(struct ipv6hdr); break; default: NL_SET_ERR_MSG(extack, "Unsupported XFRM mode for IPcomp"); goto out; } err = ipcomp_init_state(x, extack); if (err) goto out; if (x->props.mode == XFRM_MODE_TUNNEL) { err = ipcomp6_tunnel_attach(x); if (err) { NL_SET_ERR_MSG(extack, "Kernel error: failed to initialize the associated state"); goto out; } } err = 0; out: return err; } static int ipcomp6_rcv_cb(struct sk_buff *skb, int err) { return 0; } static const struct xfrm_type ipcomp6_type = { .owner = THIS_MODULE, .proto = IPPROTO_COMP, .init_state = ipcomp6_init_state, .destructor = ipcomp_destroy, .input = ipcomp_input, .output = ipcomp_output, }; static struct xfrm6_protocol ipcomp6_protocol = { .handler = xfrm6_rcv, .input_handler = xfrm_input, .cb_handler = ipcomp6_rcv_cb, .err_handler = ipcomp6_err, .priority = 0, }; static int __init ipcomp6_init(void) { if (xfrm_register_type(&ipcomp6_type, AF_INET6) < 0) { pr_info("%s: can't add xfrm type\n", __func__); return -EAGAIN; } if (xfrm6_protocol_register(&ipcomp6_protocol, IPPROTO_COMP) < 0) { pr_info("%s: can't add protocol\n", __func__); xfrm_unregister_type(&ipcomp6_type, AF_INET6); return -EAGAIN; } return 0; } static void __exit ipcomp6_fini(void) { if (xfrm6_protocol_deregister(&ipcomp6_protocol, IPPROTO_COMP) < 0) pr_info("%s: can't remove protocol\n", __func__); xfrm_unregister_type(&ipcomp6_type, AF_INET6); } module_init(ipcomp6_init); module_exit(ipcomp6_fini); MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("IP Payload Compression Protocol (IPComp) for IPv6 - RFC3173"); MODULE_AUTHOR("Mitsuru KANDA <mk@linux-ipv6.org>"); MODULE_ALIAS_XFRM_TYPE(AF_INET6, XFRM_PROTO_COMP); |
| 4582 4582 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _ASM_X86_SYNC_CORE_H #define _ASM_X86_SYNC_CORE_H #include <linux/preempt.h> #include <asm/processor.h> #include <asm/cpufeature.h> #include <asm/special_insns.h> #ifdef CONFIG_X86_32 static __always_inline void iret_to_self(void) { asm volatile ( "pushfl\n\t" "pushl %%cs\n\t" "pushl $1f\n\t" "iret\n\t" "1:" : ASM_CALL_CONSTRAINT : : "memory"); } #else static __always_inline void iret_to_self(void) { unsigned int tmp; asm volatile ( "mov %%ss, %0\n\t" "pushq %q0\n\t" "pushq %%rsp\n\t" "addq $8, (%%rsp)\n\t" "pushfq\n\t" "mov %%cs, %0\n\t" "pushq %q0\n\t" "pushq $1f\n\t" "iretq\n\t" "1:" : "=&r" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory"); } #endif /* CONFIG_X86_32 */ /* * This function forces the icache and prefetched instruction stream to * catch up with reality in two very specific cases: * * a) Text was modified using one virtual address and is about to be executed * from the same physical page at a different virtual address. * * b) Text was modified on a different CPU, may subsequently be * executed on this CPU, and you want to make sure the new version * gets executed. This generally means you're calling this in an IPI. * * If you're calling this for a different reason, you're probably doing * it wrong. * * Like all of Linux's memory ordering operations, this is a * compiler barrier as well. */ static __always_inline void sync_core(void) { /* * The SERIALIZE instruction is the most straightforward way to * do this, but it is not universally available. */ if (static_cpu_has(X86_FEATURE_SERIALIZE)) { serialize(); return; } /* * For all other processors, there are quite a few ways to do this. * IRET-to-self is nice because it works on every CPU, at any CPL * (so it's compatible with paravirtualization), and it never exits * to a hypervisor. The only downsides are that it's a bit slow * (it seems to be a bit more than 2x slower than the fastest * options) and that it unmasks NMIs. The "push %cs" is needed, * because in paravirtual environments __KERNEL_CS may not be a * valid CS value when we do IRET directly. * * In case NMI unmasking or performance ever becomes a problem, * the next best option appears to be MOV-to-CR2 and an * unconditional jump. That sequence also works on all CPUs, * but it will fault at CPL3 (i.e. Xen PV). * * CPUID is the conventional way, but it's nasty: it doesn't * exist on some 486-like CPUs, and it usually exits to a * hypervisor. */ iret_to_self(); } /* * Ensure that a core serializing instruction is issued before returning * to user-mode. x86 implements return to user-space through sysexit, * sysrel, and sysretq, which are not core serializing. */ static inline void sync_core_before_usermode(void) { /* With PTI, we unconditionally serialize before running user code. */ if (static_cpu_has(X86_FEATURE_PTI)) return; /* * Even if we're in an interrupt, we might reschedule before returning, * in which case we could switch to a different thread in the same mm * and return using SYSRET or SYSEXIT. Instead of trying to keep * track of our need to sync the core, just sync right away. */ sync_core(); } #endif /* _ASM_X86_SYNC_CORE_H */ |
| 20 1 173 478 9 50 175 136 219 133 12 10 39 313 177 9 164 23 20 20 162 162 112 176 49 41 30 9 41 213 211 2 200 89 90 90 4 4 172 24 169 169 24 117 175 174 173 66 11 13 13 8 348 348 347 346 104 105 105 105 85 204 204 203 3 203 1 1 133 133 3 133 88 12 4 13 11 176 174 8 6 2 8 8 8 192 8 8 194 124 120 192 318 113 94 4 171 1 246 246 246 8 3 3 46 46 30 14 14 51 51 51 51 51 51 51 51 67 67 152 152 152 152 47 219 219 162 152 152 28 100 100 2 98 98 93 28 3 77 86 86 85 70 50 52 48 21 50 50 43 11 38 18 49 48 50 50 49 93 6 47 47 44 47 11 178 158 6 16 109 3 1 2 169 226 6 7 140 104 203 301 298 1 11 2 10 46 297 149 301 96 3 9 7 1 17 17 14 39 39 22 25 7 6 1 16 36 25 5 4 4 3 1 2 2 1 4 1 3 4 1 6 1 3 59 1 2 2 1 6 10 45 2 21 11 10 1 1 2 3 7 2 1 1 3 1 21 30 17 14 11 4 1 1 3 1 4 7 2 3 1 1 4 3 1 3 2 133 25 13 11 17 9 9 297 298 299 156 155 5 3 13 41 128 237 15 1 185 206 192 1 223 221 188 181 194 174 194 187 152 35 5 15 12 4 146 4 4 12 10 6 3 154 21 9 148 6 7 56 5 13 8 14 9 1 3 3 1 1 171 170 17 6 1 7 3 15 17 9 3 12 5 5 10 6 8 14 2 10 12 7 10 17 80 4 82 2 82 67 19 25 11 13 5 6 8 10 2 21 16 59 6 2 1 2 11 11 6 5 12 3 16 11 5 39 133 34 4 36 248 166 193 27 4 4 1 4 11 56 11 143 10 174 162 59 28 144 56 172 9 28 10 2 11 5 4 153 21 203 298 107 200 205 180 171 195 277 190 187 211 180 196 293 200 142 205 186 10 119 162 265 177 3 177 120 272 234 186 221 195 204 272 266 187 267 142 141 271 101 268 200 271 8 271 255 42 272 107 215 203 271 162 173 270 53 300 239 126 221 299 188 253 253 254 272 272 299 4 7 1 2 77 77 1 3 45 8 2 1 2 2 2 2 1 1 3 2 1 2 224 262 320 19 19 27 27 343 342 216 216 216 219 19 18 18 1 152 152 152 230 550 2 2 4 4 2 1 2 2 2 2 2 4 4 4 1 1 3 1 2 3 3 3 62 63 31 53 206 6 204 6 1 5 5 5 5 5 5 2 2 2 156 156 156 19 4 8 14 3 19 14 3 2 64 1 2 1 4 20 37 6 1 1 3 1 3 94 89 65 23 5 1 43 35 22 17 15 11 11 11 11 11 13 11 12 16 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046 4047 4048 4049 4050 4051 4052 4053 4054 4055 4056 4057 4058 4059 4060 4061 4062 4063 4064 4065 4066 4067 4068 4069 4070 4071 4072 4073 4074 4075 4076 4077 4078 4079 4080 4081 4082 4083 4084 4085 4086 4087 4088 4089 4090 4091 4092 4093 4094 4095 4096 4097 4098 4099 4100 4101 4102 4103 4104 4105 4106 4107 4108 4109 4110 4111 4112 4113 4114 4115 4116 4117 4118 4119 4120 4121 4122 4123 4124 4125 4126 4127 4128 4129 4130 4131 4132 4133 4134 4135 4136 4137 4138 4139 4140 4141 4142 4143 4144 4145 4146 4147 4148 4149 4150 4151 4152 4153 4154 4155 4156 4157 4158 4159 4160 4161 4162 4163 4164 4165 4166 4167 4168 4169 4170 4171 4172 4173 4174 4175 4176 4177 4178 4179 4180 4181 4182 4183 4184 4185 4186 4187 4188 4189 4190 4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 4202 4203 4204 4205 4206 4207 4208 4209 4210 4211 4212 4213 4214 4215 4216 4217 4218 4219 4220 4221 4222 4223 4224 4225 4226 4227 4228 4229 4230 4231 4232 4233 4234 4235 4236 4237 4238 4239 4240 4241 4242 4243 4244 4245 4246 4247 4248 4249 4250 4251 4252 4253 4254 4255 4256 4257 4258 4259 4260 4261 4262 4263 4264 4265 4266 4267 4268 4269 4270 4271 4272 4273 4274 4275 4276 4277 4278 4279 4280 4281 4282 4283 4284 4285 4286 4287 4288 4289 4290 4291 4292 4293 4294 4295 4296 4297 4298 4299 4300 4301 4302 4303 4304 4305 4306 4307 4308 4309 4310 4311 4312 4313 4314 4315 4316 4317 4318 4319 4320 4321 4322 4323 4324 4325 4326 4327 4328 4329 4330 4331 4332 4333 4334 4335 4336 4337 4338 4339 4340 4341 4342 4343 4344 4345 4346 4347 4348 4349 4350 4351 4352 4353 4354 4355 4356 4357 4358 4359 4360 4361 4362 4363 4364 4365 4366 4367 4368 4369 4370 4371 4372 4373 4374 4375 4376 4377 4378 4379 4380 4381 4382 4383 4384 4385 4386 4387 4388 4389 4390 4391 4392 4393 4394 4395 4396 4397 4398 4399 4400 4401 4402 4403 4404 4405 4406 4407 4408 4409 4410 4411 4412 4413 4414 4415 4416 4417 4418 4419 4420 4421 4422 4423 4424 4425 4426 4427 4428 4429 4430 4431 4432 4433 4434 4435 4436 4437 4438 4439 4440 4441 4442 4443 4444 4445 4446 4447 4448 4449 4450 4451 4452 4453 4454 4455 4456 4457 4458 4459 4460 4461 4462 4463 4464 4465 4466 4467 4468 4469 4470 4471 4472 4473 4474 4475 4476 4477 4478 4479 4480 4481 4482 4483 4484 4485 4486 4487 4488 4489 4490 4491 4492 4493 4494 4495 4496 4497 4498 4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 4509 4510 4511 4512 4513 4514 4515 4516 4517 4518 4519 4520 4521 4522 4523 4524 4525 4526 4527 4528 4529 4530 4531 4532 4533 4534 4535 4536 4537 4538 4539 4540 4541 4542 4543 4544 4545 4546 4547 4548 4549 4550 4551 4552 4553 4554 4555 4556 4557 4558 4559 4560 4561 4562 4563 4564 4565 4566 4567 4568 4569 4570 4571 4572 4573 4574 4575 4576 4577 4578 4579 4580 4581 4582 4583 4584 4585 4586 4587 4588 4589 4590 4591 4592 4593 4594 4595 4596 4597 4598 4599 4600 4601 4602 4603 4604 4605 4606 4607 4608 4609 4610 4611 4612 4613 4614 4615 4616 4617 4618 4619 4620 4621 4622 4623 4624 4625 4626 4627 4628 4629 4630 4631 4632 4633 4634 4635 4636 4637 4638 4639 4640 4641 4642 4643 4644 4645 4646 4647 4648 4649 4650 4651 4652 4653 4654 4655 4656 4657 4658 4659 4660 4661 4662 4663 4664 4665 4666 4667 4668 4669 4670 4671 4672 4673 4674 4675 4676 4677 4678 4679 4680 4681 4682 4683 4684 4685 4686 4687 4688 4689 4690 4691 4692 4693 4694 4695 4696 4697 4698 4699 4700 4701 4702 4703 4704 4705 4706 4707 4708 4709 4710 4711 4712 4713 4714 4715 4716 4717 4718 4719 4720 4721 4722 4723 4724 4725 4726 4727 4728 4729 4730 4731 4732 4733 4734 4735 4736 4737 4738 4739 4740 4741 4742 4743 4744 4745 4746 4747 4748 4749 4750 4751 4752 4753 4754 4755 4756 4757 4758 4759 4760 4761 4762 4763 4764 4765 4766 4767 4768 4769 4770 4771 4772 4773 4774 4775 4776 4777 4778 4779 4780 4781 4782 4783 4784 4785 4786 4787 4788 4789 4790 4791 4792 4793 4794 4795 4796 4797 4798 4799 4800 4801 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816 4817 4818 4819 4820 4821 4822 4823 4824 4825 4826 4827 4828 4829 4830 4831 4832 4833 4834 4835 4836 4837 4838 4839 4840 4841 4842 4843 4844 4845 4846 4847 4848 4849 4850 4851 4852 4853 4854 4855 4856 4857 4858 4859 4860 4861 4862 4863 4864 4865 4866 4867 4868 4869 4870 4871 4872 4873 4874 4875 4876 4877 4878 4879 4880 4881 4882 4883 4884 4885 4886 4887 4888 4889 4890 4891 4892 4893 4894 4895 4896 4897 4898 4899 4900 4901 4902 4903 4904 4905 4906 4907 4908 4909 4910 4911 4912 4913 4914 4915 4916 4917 4918 4919 4920 4921 4922 4923 4924 4925 4926 4927 4928 4929 4930 4931 4932 4933 4934 4935 4936 4937 4938 4939 4940 4941 4942 4943 4944 4945 4946 4947 4948 4949 4950 4951 4952 4953 4954 4955 4956 4957 4958 4959 4960 4961 4962 4963 4964 4965 4966 4967 4968 4969 4970 4971 4972 4973 4974 4975 4976 4977 4978 4979 4980 4981 4982 4983 4984 4985 4986 4987 4988 4989 4990 4991 4992 4993 4994 4995 4996 4997 4998 4999 5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 1991, 1992 Linus Torvalds */ /* * Hopefully this will be a rather complete VT102 implementation. * * Beeping thanks to John T Kohl. * * Virtual Consoles, Screen Blanking, Screen Dumping, Color, Graphics * Chars, and VT100 enhancements by Peter MacDonald. * * Copy and paste function by Andrew Haylett, * some enhancements by Alessandro Rubini. * * Code to check for different video-cards mostly by Galen Hunt, * <g-hunt@ee.utah.edu> * * Rudimentary ISO 10646/Unicode/UTF-8 character set support by * Markus Kuhn, <mskuhn@immd4.informatik.uni-erlangen.de>. * * Dynamic allocation of consoles, aeb@cwi.nl, May 1994 * Resizing of consoles, aeb, 940926 * * Code for xterm like mouse click reporting by Peter Orbaek 20-Jul-94 * <poe@daimi.aau.dk> * * User-defined bell sound, new setterm control sequences and printk * redirection by Martin Mares <mj@k332.feld.cvut.cz> 19-Nov-95 * * APM screenblank bug fixed Takashi Manabe <manabe@roy.dsl.tutics.tut.jp> * * Merge with the abstract console driver by Geert Uytterhoeven * <geert@linux-m68k.org>, Jan 1997. * * Original m68k console driver modifications by * * - Arno Griffioen <arno@usn.nl> * - David Carter <carter@cs.bris.ac.uk> * * The abstract console driver provides a generic interface for a text * console. It supports VGA text mode, frame buffer based graphical consoles * and special graphics processors that are only accessible through some * registers (e.g. a TMS340x0 GSP). * * The interface to the hardware is specified using a special structure * (struct consw) which contains function pointers to console operations * (see <linux/console.h> for more information). * * Support for changeable cursor shape * by Pavel Machek <pavel@atrey.karlin.mff.cuni.cz>, August 1997 * * Ported to i386 and con_scrolldelta fixed * by Emmanuel Marty <core@ggi-project.org>, April 1998 * * Resurrected character buffers in videoram plus lots of other trickery * by Martin Mares <mj@atrey.karlin.mff.cuni.cz>, July 1998 * * Removed old-style timers, introduced console_timer, made timer * deletion SMP-safe. 17Jun00, Andrew Morton * * Removed console_lock, enabled interrupts across all console operations * 13 March 2001, Andrew Morton * * Fixed UTF-8 mode so alternate charset modes always work according * to control sequences interpreted in do_con_trol function * preserving backward VT100 semigraphics compatibility, * malformed UTF sequences represented as sequences of replacement glyphs, * original codes or '?' as a last resort if replacement glyph is undefined * by Adam Tla/lka <atlka@pg.gda.pl>, Aug 2006 */ #include <linux/module.h> #include <linux/types.h> #include <linux/sched/signal.h> #include <linux/tty.h> #include <linux/tty_flip.h> #include <linux/kernel.h> #include <linux/string.h> #include <linux/errno.h> #include <linux/kd.h> #include <linux/slab.h> #include <linux/vmalloc.h> #include <linux/major.h> #include <linux/mm.h> #include <linux/console.h> #include <linux/init.h> #include <linux/mutex.h> #include <linux/vt_kern.h> #include <linux/selection.h> #include <linux/tiocl.h> #include <linux/kbd_kern.h> #include <linux/consolemap.h> #include <linux/timer.h> #include <linux/interrupt.h> #include <linux/workqueue.h> #include <linux/pm.h> #include <linux/font.h> #include <linux/bitops.h> #include <linux/notifier.h> #include <linux/device.h> #include <linux/io.h> #include <linux/uaccess.h> #include <linux/kdb.h> #include <linux/ctype.h> #include <linux/gcd.h> #define MAX_NR_CON_DRIVER 16 #define CON_DRIVER_FLAG_MODULE 1 #define CON_DRIVER_FLAG_INIT 2 #define CON_DRIVER_FLAG_ATTR 4 #define CON_DRIVER_FLAG_ZOMBIE 8 struct con_driver { const struct consw *con; const char *desc; struct device *dev; int node; int first; int last; int flag; }; static struct con_driver registered_con_driver[MAX_NR_CON_DRIVER]; const struct consw *conswitchp; /* * Here is the default bell parameters: 750HZ, 1/8th of a second */ #define DEFAULT_BELL_PITCH 750 #define DEFAULT_BELL_DURATION (HZ/8) #define DEFAULT_CURSOR_BLINK_MS 200 struct vc vc_cons [MAX_NR_CONSOLES]; EXPORT_SYMBOL(vc_cons); static const struct consw *con_driver_map[MAX_NR_CONSOLES]; static int con_open(struct tty_struct *, struct file *); static void vc_init(struct vc_data *vc, int do_clear); static void gotoxy(struct vc_data *vc, int new_x, int new_y); static void save_cur(struct vc_data *vc); static void reset_terminal(struct vc_data *vc, int do_clear); static void con_flush_chars(struct tty_struct *tty); static int set_vesa_blanking(u8 __user *mode); static void set_cursor(struct vc_data *vc); static void hide_cursor(struct vc_data *vc); static void console_callback(struct work_struct *ignored); static void con_driver_unregister_callback(struct work_struct *ignored); static void blank_screen_t(struct timer_list *unused); static void set_palette(struct vc_data *vc); static void unblank_screen(void); #define vt_get_kmsg_redirect() vt_kmsg_redirect(-1) int default_utf8 = true; module_param(default_utf8, int, S_IRUGO | S_IWUSR); int global_cursor_default = -1; module_param(global_cursor_default, int, S_IRUGO | S_IWUSR); EXPORT_SYMBOL(global_cursor_default); static int cur_default = CUR_UNDERLINE; module_param(cur_default, int, S_IRUGO | S_IWUSR); /* * ignore_poke: don't unblank the screen when things are typed. This is * mainly for the privacy of braille terminal users. */ static int ignore_poke; int do_poke_blanked_console; int console_blanked; EXPORT_SYMBOL(console_blanked); static enum vesa_blank_mode vesa_blank_mode; static int vesa_off_interval; static int blankinterval; core_param(consoleblank, blankinterval, int, 0444); static DECLARE_WORK(console_work, console_callback); static DECLARE_WORK(con_driver_unregister_work, con_driver_unregister_callback); /* * fg_console is the current virtual console, * last_console is the last used one, * want_console is the console we want to switch to, * saved_* variants are for save/restore around kernel debugger enter/leave */ int fg_console; EXPORT_SYMBOL(fg_console); int last_console; int want_console = -1; static int saved_fg_console; static int saved_last_console; static int saved_want_console; static int saved_vc_mode; static int saved_console_blanked; /* * For each existing display, we have a pointer to console currently visible * on that display, allowing consoles other than fg_console to be refreshed * appropriately. Unless the low-level driver supplies its own display_fg * variable, we use this one for the "master display". */ static struct vc_data *master_display_fg; /* * Unfortunately, we need to delay tty echo when we're currently writing to the * console since the code is (and always was) not re-entrant, so we schedule * all flip requests to process context with schedule-task() and run it from * console_callback(). */ /* * For the same reason, we defer scrollback to the console callback. */ static int scrollback_delta; /* * Hook so that the power management routines can (un)blank * the console on our behalf. */ int (*console_blank_hook)(int); EXPORT_SYMBOL(console_blank_hook); static DEFINE_TIMER(console_timer, blank_screen_t); static int blank_state; static int blank_timer_expired; enum { blank_off = 0, blank_normal_wait, blank_vesa_wait, }; /* * /sys/class/tty/tty0/ * * the attribute 'active' contains the name of the current vc * console and it supports poll() to detect vc switches */ static struct device *tty0dev; /* * Notifier list for console events. */ static ATOMIC_NOTIFIER_HEAD(vt_notifier_list); int register_vt_notifier(struct notifier_block *nb) { return atomic_notifier_chain_register(&vt_notifier_list, nb); } EXPORT_SYMBOL_GPL(register_vt_notifier); int unregister_vt_notifier(struct notifier_block *nb) { return atomic_notifier_chain_unregister(&vt_notifier_list, nb); } EXPORT_SYMBOL_GPL(unregister_vt_notifier); static void notify_write(struct vc_data *vc, unsigned int unicode) { struct vt_notifier_param param = { .vc = vc, .c = unicode }; atomic_notifier_call_chain(&vt_notifier_list, VT_WRITE, ¶m); } static void notify_update(struct vc_data *vc) { struct vt_notifier_param param = { .vc = vc }; atomic_notifier_call_chain(&vt_notifier_list, VT_UPDATE, ¶m); } /* * Low-Level Functions */ static inline bool con_is_fg(const struct vc_data *vc) { return vc->vc_num == fg_console; } static inline bool con_should_update(const struct vc_data *vc) { return con_is_visible(vc) && !console_blanked; } static inline u16 *screenpos(const struct vc_data *vc, unsigned int offset, bool viewed) { unsigned long origin = viewed ? vc->vc_visible_origin : vc->vc_origin; return (u16 *)(origin + offset); } static void con_putc(struct vc_data *vc, u16 ca, unsigned int y, unsigned int x) { if (vc->vc_sw->con_putc) vc->vc_sw->con_putc(vc, ca, y, x); else vc->vc_sw->con_putcs(vc, &ca, 1, y, x); } /* Called from the keyboard irq path.. */ static inline void scrolldelta(int lines) { /* FIXME */ /* scrolldelta needs some kind of consistency lock, but the BKL was and still is not protecting versus the scheduled back end */ scrollback_delta += lines; schedule_console_callback(); } void schedule_console_callback(void) { schedule_work(&console_work); } /* * Code to manage unicode-based screen buffers */ /* * Our screen buffer is preceded by an array of line pointers so that * scrolling only implies some pointer shuffling. */ static u32 **vc_uniscr_alloc(unsigned int cols, unsigned int rows) { u32 **uni_lines; void *p; unsigned int memsize, i, col_size = cols * sizeof(**uni_lines); /* allocate everything in one go */ memsize = col_size * rows; memsize += rows * sizeof(*uni_lines); uni_lines = vzalloc(memsize); if (!uni_lines) return NULL; /* initial line pointers */ p = uni_lines + rows; for (i = 0; i < rows; i++) { uni_lines[i] = p; p += col_size; } return uni_lines; } static void vc_uniscr_free(u32 **uni_lines) { vfree(uni_lines); } static void vc_uniscr_set(struct vc_data *vc, u32 **new_uni_lines) { vc_uniscr_free(vc->vc_uni_lines); vc->vc_uni_lines = new_uni_lines; } static void vc_uniscr_putc(struct vc_data *vc, u32 uc) { if (vc->vc_uni_lines) vc->vc_uni_lines[vc->state.y][vc->state.x] = uc; } static void vc_uniscr_insert(struct vc_data *vc, unsigned int nr) { if (vc->vc_uni_lines) { u32 *ln = vc->vc_uni_lines[vc->state.y]; unsigned int x = vc->state.x, cols = vc->vc_cols; memmove(&ln[x + nr], &ln[x], (cols - x - nr) * sizeof(*ln)); memset32(&ln[x], ' ', nr); } } static void vc_uniscr_delete(struct vc_data *vc, unsigned int nr) { if (vc->vc_uni_lines) { u32 *ln = vc->vc_uni_lines[vc->state.y]; unsigned int x = vc->state.x, cols = vc->vc_cols; memmove(&ln[x], &ln[x + nr], (cols - x - nr) * sizeof(*ln)); memset32(&ln[cols - nr], ' ', nr); } } static void vc_uniscr_clear_line(struct vc_data *vc, unsigned int x, unsigned int nr) { if (vc->vc_uni_lines) memset32(&vc->vc_uni_lines[vc->state.y][x], ' ', nr); } static void vc_uniscr_clear_lines(struct vc_data *vc, unsigned int y, unsigned int nr) { if (vc->vc_uni_lines) while (nr--) memset32(vc->vc_uni_lines[y++], ' ', vc->vc_cols); } /* juggling array rotation algorithm (complexity O(N), size complexity O(1)) */ static void juggle_array(u32 **array, unsigned int size, unsigned int nr) { unsigned int gcd_idx; for (gcd_idx = 0; gcd_idx < gcd(nr, size); gcd_idx++) { u32 *gcd_idx_val = array[gcd_idx]; unsigned int dst_idx = gcd_idx; while (1) { unsigned int src_idx = (dst_idx + nr) % size; if (src_idx == gcd_idx) break; array[dst_idx] = array[src_idx]; dst_idx = src_idx; } array[dst_idx] = gcd_idx_val; } } static void vc_uniscr_scroll(struct vc_data *vc, unsigned int top, unsigned int bottom, enum con_scroll dir, unsigned int nr) { u32 **uni_lines = vc->vc_uni_lines; unsigned int size = bottom - top; if (!uni_lines) return; if (dir == SM_DOWN) { juggle_array(&uni_lines[top], size, size - nr); vc_uniscr_clear_lines(vc, top, nr); } else { juggle_array(&uni_lines[top], size, nr); vc_uniscr_clear_lines(vc, bottom - nr, nr); } } static u32 vc_uniscr_getc(struct vc_data *vc, int relative_pos) { int pos = vc->state.x + vc->vc_need_wrap + relative_pos; if (vc->vc_uni_lines && in_range(pos, 0, vc->vc_cols)) return vc->vc_uni_lines[vc->state.y][pos]; return 0; } static void vc_uniscr_copy_area(u32 **dst_lines, unsigned int dst_cols, unsigned int dst_rows, u32 **src_lines, unsigned int src_cols, unsigned int src_top_row, unsigned int src_bot_row) { unsigned int dst_row = 0; if (!dst_lines) return; while (src_top_row < src_bot_row) { u32 *src_line = src_lines[src_top_row]; u32 *dst_line = dst_lines[dst_row]; memcpy(dst_line, src_line, src_cols * sizeof(*src_line)); if (dst_cols - src_cols) memset32(dst_line + src_cols, ' ', dst_cols - src_cols); src_top_row++; dst_row++; } while (dst_row < dst_rows) { u32 *dst_line = dst_lines[dst_row]; memset32(dst_line, ' ', dst_cols); dst_row++; } } /* * Called from vcs_read() to make sure unicode screen retrieval is possible. * This will initialize the unicode screen buffer if not already done. * This returns 0 if OK, or a negative error code otherwise. * In particular, -ENODATA is returned if the console is not in UTF-8 mode. */ int vc_uniscr_check(struct vc_data *vc) { u32 **uni_lines; unsigned short *p; int x, y, mask; WARN_CONSOLE_UNLOCKED(); if (!vc->vc_utf) return -ENODATA; if (vc->vc_uni_lines) return 0; uni_lines = vc_uniscr_alloc(vc->vc_cols, vc->vc_rows); if (!uni_lines) return -ENOMEM; /* * Let's populate it initially with (imperfect) reverse translation. * This is the next best thing we can do short of having it enabled * from the start even when no users rely on this functionality. True * unicode content will be available after a complete screen refresh. */ p = (unsigned short *)vc->vc_origin; mask = vc->vc_hi_font_mask | 0xff; for (y = 0; y < vc->vc_rows; y++) { u32 *line = uni_lines[y]; for (x = 0; x < vc->vc_cols; x++) { u16 glyph = scr_readw(p++) & mask; line[x] = inverse_translate(vc, glyph, true); } } vc->vc_uni_lines = uni_lines; return 0; } /* * Called from vcs_read() to get the unicode data from the screen. * This must be preceded by a successful call to vc_uniscr_check() once * the console lock has been taken. */ void vc_uniscr_copy_line(const struct vc_data *vc, void *dest, bool viewed, unsigned int row, unsigned int col, unsigned int nr) { u32 **uni_lines = vc->vc_uni_lines; int offset = row * vc->vc_size_row + col * 2; unsigned long pos; if (WARN_ON_ONCE(!uni_lines)) return; pos = (unsigned long)screenpos(vc, offset, viewed); if (pos >= vc->vc_origin && pos < vc->vc_scr_end) { /* * Desired position falls in the main screen buffer. * However the actual row/col might be different if * scrollback is active. */ row = (pos - vc->vc_origin) / vc->vc_size_row; col = ((pos - vc->vc_origin) % vc->vc_size_row) / 2; memcpy(dest, &uni_lines[row][col], nr * sizeof(u32)); } else { /* * Scrollback is active. For now let's simply backtranslate * the screen glyphs until the unicode screen buffer does * synchronize with console display drivers for a scrollback * buffer of its own. */ u16 *p = (u16 *)pos; int mask = vc->vc_hi_font_mask | 0xff; u32 *uni_buf = dest; while (nr--) { u16 glyph = scr_readw(p++) & mask; *uni_buf++ = inverse_translate(vc, glyph, true); } } } static void con_scroll(struct vc_data *vc, unsigned int top, unsigned int bottom, enum con_scroll dir, unsigned int nr) { unsigned int rows = bottom - top; u16 *clear, *dst, *src; if (top + nr >= bottom) nr = rows - 1; if (bottom > vc->vc_rows || top >= bottom || nr < 1) return; vc_uniscr_scroll(vc, top, bottom, dir, nr); if (con_is_visible(vc) && vc->vc_sw->con_scroll(vc, top, bottom, dir, nr)) return; src = clear = (u16 *)(vc->vc_origin + vc->vc_size_row * top); dst = (u16 *)(vc->vc_origin + vc->vc_size_row * (top + nr)); if (dir == SM_UP) { clear = src + (rows - nr) * vc->vc_cols; swap(src, dst); } scr_memmovew(dst, src, (rows - nr) * vc->vc_size_row); scr_memsetw(clear, vc->vc_video_erase_char, vc->vc_size_row * nr); } static void do_update_region(struct vc_data *vc, unsigned long start, int count) { unsigned int xx, yy, offset; u16 *p = (u16 *)start; offset = (start - vc->vc_origin) / 2; xx = offset % vc->vc_cols; yy = offset / vc->vc_cols; for(;;) { u16 attrib = scr_readw(p) & 0xff00; int startx = xx; u16 *q = p; while (xx < vc->vc_cols && count) { if (attrib != (scr_readw(p) & 0xff00)) { if (p > q) vc->vc_sw->con_putcs(vc, q, p-q, yy, startx); startx = xx; q = p; attrib = scr_readw(p) & 0xff00; } p++; xx++; count--; } if (p > q) vc->vc_sw->con_putcs(vc, q, p-q, yy, startx); if (!count) break; xx = 0; yy++; } } void update_region(struct vc_data *vc, unsigned long start, int count) { WARN_CONSOLE_UNLOCKED(); if (con_should_update(vc)) { hide_cursor(vc); do_update_region(vc, start, count); set_cursor(vc); } } EXPORT_SYMBOL(update_region); /* Structure of attributes is hardware-dependent */ static u8 build_attr(struct vc_data *vc, u8 _color, enum vc_intensity _intensity, bool _blink, bool _underline, bool _reverse, bool _italic) { if (vc->vc_sw->con_build_attr) return vc->vc_sw->con_build_attr(vc, _color, _intensity, _blink, _underline, _reverse, _italic); /* * ++roman: I completely changed the attribute format for monochrome * mode (!can_do_color). The formerly used MDA (monochrome display * adapter) format didn't allow the combination of certain effects. * Now the attribute is just a bit vector: * Bit 0..1: intensity (0..2) * Bit 2 : underline * Bit 3 : reverse * Bit 7 : blink */ { u8 a = _color; if (!vc->vc_can_do_color) return _intensity | (_italic << 1) | (_underline << 2) | (_reverse << 3) | (_blink << 7); if (_italic) a = (a & 0xF0) | vc->vc_itcolor; else if (_underline) a = (a & 0xf0) | vc->vc_ulcolor; else if (_intensity == VCI_HALF_BRIGHT) a = (a & 0xf0) | vc->vc_halfcolor; if (_reverse) a = (a & 0x88) | (((a >> 4) | (a << 4)) & 0x77); if (_blink) a ^= 0x80; if (_intensity == VCI_BOLD) a ^= 0x08; if (vc->vc_hi_font_mask == 0x100) a <<= 1; return a; } } static void update_attr(struct vc_data *vc) { vc->vc_attr = build_attr(vc, vc->state.color, vc->state.intensity, vc->state.blink, vc->state.underline, vc->state.reverse ^ vc->vc_decscnm, vc->state.italic); vc->vc_video_erase_char = ' ' | (build_attr(vc, vc->state.color, VCI_NORMAL, vc->state.blink, false, vc->vc_decscnm, false) << 8); } /* Note: inverting the screen twice should revert to the original state */ void invert_screen(struct vc_data *vc, int offset, int count, bool viewed) { u16 *p; WARN_CONSOLE_UNLOCKED(); count /= 2; p = screenpos(vc, offset, viewed); if (vc->vc_sw->con_invert_region) { vc->vc_sw->con_invert_region(vc, p, count); } else { u16 *q = p; int cnt = count; u16 a; if (!vc->vc_can_do_color) { while (cnt--) { a = scr_readw(q); a ^= 0x0800; scr_writew(a, q); q++; } } else if (vc->vc_hi_font_mask == 0x100) { while (cnt--) { a = scr_readw(q); a = (a & 0x11ff) | ((a & 0xe000) >> 4) | ((a & 0x0e00) << 4); scr_writew(a, q); q++; } } else { while (cnt--) { a = scr_readw(q); a = (a & 0x88ff) | ((a & 0x7000) >> 4) | ((a & 0x0700) << 4); scr_writew(a, q); q++; } } } if (con_should_update(vc)) do_update_region(vc, (unsigned long) p, count); notify_update(vc); } /* used by selection: complement pointer position */ void complement_pos(struct vc_data *vc, int offset) { static int old_offset = -1; static unsigned short old; static unsigned short oldx, oldy; WARN_CONSOLE_UNLOCKED(); if (old_offset != -1 && old_offset >= 0 && old_offset < vc->vc_screenbuf_size) { scr_writew(old, screenpos(vc, old_offset, true)); if (con_should_update(vc)) con_putc(vc, old, oldy, oldx); notify_update(vc); } old_offset = offset; if (offset != -1 && offset >= 0 && offset < vc->vc_screenbuf_size) { unsigned short new; u16 *p = screenpos(vc, offset, true); old = scr_readw(p); new = old ^ vc->vc_complement_mask; scr_writew(new, p); if (con_should_update(vc)) { oldx = (offset >> 1) % vc->vc_cols; oldy = (offset >> 1) / vc->vc_cols; con_putc(vc, new, oldy, oldx); } notify_update(vc); } } static void insert_char(struct vc_data *vc, unsigned int nr) { unsigned short *p = (unsigned short *) vc->vc_pos; vc_uniscr_insert(vc, nr); scr_memmovew(p + nr, p, (vc->vc_cols - vc->state.x - nr) * 2); scr_memsetw(p, vc->vc_video_erase_char, nr * 2); vc->vc_need_wrap = 0; if (con_should_update(vc)) do_update_region(vc, (unsigned long) p, vc->vc_cols - vc->state.x); } static void delete_char(struct vc_data *vc, unsigned int nr) { unsigned short *p = (unsigned short *) vc->vc_pos; vc_uniscr_delete(vc, nr); scr_memmovew(p, p + nr, (vc->vc_cols - vc->state.x - nr) * 2); scr_memsetw(p + vc->vc_cols - vc->state.x - nr, vc->vc_video_erase_char, nr * 2); vc->vc_need_wrap = 0; if (con_should_update(vc)) do_update_region(vc, (unsigned long) p, vc->vc_cols - vc->state.x); } static int softcursor_original = -1; static void add_softcursor(struct vc_data *vc) { int i = scr_readw((u16 *) vc->vc_pos); u32 type = vc->vc_cursor_type; if (!(type & CUR_SW)) return; if (softcursor_original != -1) return; softcursor_original = i; i |= CUR_SET(type); i ^= CUR_CHANGE(type); if ((type & CUR_ALWAYS_BG) && (softcursor_original & CUR_BG) == (i & CUR_BG)) i ^= CUR_BG; if ((type & CUR_INVERT_FG_BG) && (i & CUR_FG) == ((i & CUR_BG) >> 4)) i ^= CUR_FG; scr_writew(i, (u16 *)vc->vc_pos); if (con_should_update(vc)) con_putc(vc, i, vc->state.y, vc->state.x); } static void hide_softcursor(struct vc_data *vc) { if (softcursor_original != -1) { scr_writew(softcursor_original, (u16 *)vc->vc_pos); if (con_should_update(vc)) con_putc(vc, softcursor_original, vc->state.y, vc->state.x); softcursor_original = -1; } } static void hide_cursor(struct vc_data *vc) { if (vc_is_sel(vc)) clear_selection(); vc->vc_sw->con_cursor(vc, false); hide_softcursor(vc); } static void set_cursor(struct vc_data *vc) { if (!con_is_fg(vc) || console_blanked || vc->vc_mode == KD_GRAPHICS) return; if (vc->vc_deccm) { if (vc_is_sel(vc)) clear_selection(); add_softcursor(vc); if (CUR_SIZE(vc->vc_cursor_type) != CUR_NONE) vc->vc_sw->con_cursor(vc, true); } else hide_cursor(vc); } static void set_origin(struct vc_data *vc) { WARN_CONSOLE_UNLOCKED(); if (!con_is_visible(vc) || !vc->vc_sw->con_set_origin || !vc->vc_sw->con_set_origin(vc)) vc->vc_origin = (unsigned long)vc->vc_screenbuf; vc->vc_visible_origin = vc->vc_origin; vc->vc_scr_end = vc->vc_origin + vc->vc_screenbuf_size; vc->vc_pos = vc->vc_origin + vc->vc_size_row * vc->state.y + 2 * vc->state.x; } static void save_screen(struct vc_data *vc) { WARN_CONSOLE_UNLOCKED(); if (vc->vc_sw->con_save_screen) vc->vc_sw->con_save_screen(vc); } static void flush_scrollback(struct vc_data *vc) { WARN_CONSOLE_UNLOCKED(); set_origin(vc); if (!con_is_visible(vc)) return; /* * The legacy way for flushing the scrollback buffer is to use a side * effect of the con_switch method. We do it only on the foreground * console as background consoles have no scrollback buffers in that * case and we obviously don't want to switch to them. */ hide_cursor(vc); vc->vc_sw->con_switch(vc); set_cursor(vc); } /* * Redrawing of screen */ void clear_buffer_attributes(struct vc_data *vc) { unsigned short *p = (unsigned short *)vc->vc_origin; int count = vc->vc_screenbuf_size / 2; int mask = vc->vc_hi_font_mask | 0xff; for (; count > 0; count--, p++) { scr_writew((scr_readw(p)&mask) | (vc->vc_video_erase_char & ~mask), p); } } void redraw_screen(struct vc_data *vc, int is_switch) { int redraw = 0; WARN_CONSOLE_UNLOCKED(); if (!vc) { /* strange ... */ /* printk("redraw_screen: tty %d not allocated ??\n", new_console+1); */ return; } if (is_switch) { struct vc_data *old_vc = vc_cons[fg_console].d; if (old_vc == vc) return; if (!con_is_visible(vc)) redraw = 1; *vc->vc_display_fg = vc; fg_console = vc->vc_num; hide_cursor(old_vc); if (!con_is_visible(old_vc)) { save_screen(old_vc); set_origin(old_vc); } if (tty0dev) sysfs_notify(&tty0dev->kobj, NULL, "active"); } else { hide_cursor(vc); redraw = 1; } if (redraw) { bool update; int old_was_color = vc->vc_can_do_color; set_origin(vc); update = vc->vc_sw->con_switch(vc); set_palette(vc); /* * If console changed from mono<->color, the best we can do * is to clear the buffer attributes. As it currently stands, * rebuilding new attributes from the old buffer is not doable * without overly complex code. */ if (old_was_color != vc->vc_can_do_color) { update_attr(vc); clear_buffer_attributes(vc); } if (update && vc->vc_mode != KD_GRAPHICS) do_update_region(vc, vc->vc_origin, vc->vc_screenbuf_size / 2); } set_cursor(vc); if (is_switch) { vt_set_leds_compute_shiftstate(); notify_update(vc); } } EXPORT_SYMBOL(redraw_screen); /* * Allocation, freeing and resizing of VTs. */ int vc_cons_allocated(unsigned int i) { return (i < MAX_NR_CONSOLES && vc_cons[i].d); } static void visual_init(struct vc_data *vc, int num, bool init) { /* ++Geert: vc->vc_sw->con_init determines console size */ if (vc->vc_sw) module_put(vc->vc_sw->owner); vc->vc_sw = conswitchp; if (con_driver_map[num]) vc->vc_sw = con_driver_map[num]; __module_get(vc->vc_sw->owner); vc->vc_num = num; vc->vc_display_fg = &master_display_fg; if (vc->uni_pagedict_loc) con_free_unimap(vc); vc->uni_pagedict_loc = &vc->uni_pagedict; vc->uni_pagedict = NULL; vc->vc_hi_font_mask = 0; vc->vc_complement_mask = 0; vc->vc_can_do_color = 0; vc->vc_cur_blink_ms = DEFAULT_CURSOR_BLINK_MS; vc->vc_sw->con_init(vc, init); if (!vc->vc_complement_mask) vc->vc_complement_mask = vc->vc_can_do_color ? 0x7700 : 0x0800; vc->vc_s_complement_mask = vc->vc_complement_mask; vc->vc_size_row = vc->vc_cols << 1; vc->vc_screenbuf_size = vc->vc_rows * vc->vc_size_row; } static void visual_deinit(struct vc_data *vc) { vc->vc_sw->con_deinit(vc); module_put(vc->vc_sw->owner); } static void vc_port_destruct(struct tty_port *port) { struct vc_data *vc = container_of(port, struct vc_data, port); kfree(vc); } static const struct tty_port_operations vc_port_ops = { .destruct = vc_port_destruct, }; /* * Change # of rows and columns (0 means unchanged/the size of fg_console) * [this is to be used together with some user program * like resize that changes the hardware videomode] */ #define VC_MAXCOL (32767) #define VC_MAXROW (32767) int vc_allocate(unsigned int currcons) /* return 0 on success */ { struct vt_notifier_param param; struct vc_data *vc; int err; WARN_CONSOLE_UNLOCKED(); if (currcons >= MAX_NR_CONSOLES) return -ENXIO; if (vc_cons[currcons].d) return 0; /* due to the granularity of kmalloc, we waste some memory here */ /* the alloc is done in two steps, to optimize the common situation of a 25x80 console (structsize=216, screenbuf_size=4000) */ /* although the numbers above are not valid since long ago, the point is still up-to-date and the comment still has its value even if only as a historical artifact. --mj, July 1998 */ param.vc = vc = kzalloc(sizeof(struct vc_data), GFP_KERNEL); if (!vc) return -ENOMEM; vc_cons[currcons].d = vc; tty_port_init(&vc->port); vc->port.ops = &vc_port_ops; INIT_WORK(&vc_cons[currcons].SAK_work, vc_SAK); visual_init(vc, currcons, true); if (!*vc->uni_pagedict_loc) con_set_default_unimap(vc); err = -EINVAL; if (vc->vc_cols > VC_MAXCOL || vc->vc_rows > VC_MAXROW || vc->vc_screenbuf_size > KMALLOC_MAX_SIZE || !vc->vc_screenbuf_size) goto err_free; err = -ENOMEM; vc->vc_screenbuf = kzalloc(vc->vc_screenbuf_size, GFP_KERNEL); if (!vc->vc_screenbuf) goto err_free; /* If no drivers have overridden us and the user didn't pass a boot option, default to displaying the cursor */ if (global_cursor_default == -1) global_cursor_default = 1; vc_init(vc, 1); vcs_make_sysfs(currcons); atomic_notifier_call_chain(&vt_notifier_list, VT_ALLOCATE, ¶m); return 0; err_free: visual_deinit(vc); kfree(vc); vc_cons[currcons].d = NULL; return err; } static inline int resize_screen(struct vc_data *vc, int width, int height, bool from_user) { /* Resizes the resolution of the display adapater */ int err = 0; if (vc->vc_sw->con_resize) err = vc->vc_sw->con_resize(vc, width, height, from_user); return err; } /** * vc_do_resize - resizing method for the tty * @tty: tty being resized * @vc: virtual console private data * @cols: columns * @lines: lines * @from_user: invoked by a user? * * Resize a virtual console, clipping according to the actual constraints. If * the caller passes a tty structure then update the termios winsize * information and perform any necessary signal handling. * * Locking: Caller must hold the console semaphore. Takes the termios rwsem and * ctrl.lock of the tty IFF a tty is passed. */ static int vc_do_resize(struct tty_struct *tty, struct vc_data *vc, unsigned int cols, unsigned int lines, bool from_user) { unsigned long old_origin, new_origin, new_scr_end, rlth, rrem, err = 0; unsigned long end; unsigned int old_rows, old_row_size, first_copied_row; unsigned int new_cols, new_rows, new_row_size, new_screen_size; unsigned short *oldscreen, *newscreen; u32 **new_uniscr = NULL; WARN_CONSOLE_UNLOCKED(); if (cols > VC_MAXCOL || lines > VC_MAXROW) return -EINVAL; new_cols = (cols ? cols : vc->vc_cols); new_rows = (lines ? lines : vc->vc_rows); new_row_size = new_cols << 1; new_screen_size = new_row_size * new_rows; if (new_cols == vc->vc_cols && new_rows == vc->vc_rows) { /* * This function is being called here to cover the case * where the userspace calls the FBIOPUT_VSCREENINFO twice, * passing the same fb_var_screeninfo containing the fields * yres/xres equal to a number non-multiple of vc_font.height * and yres_virtual/xres_virtual equal to number lesser than the * vc_font.height and yres/xres. * In the second call, the struct fb_var_screeninfo isn't * being modified by the underlying driver because of the * if above, and this causes the fbcon_display->vrows to become * negative and it eventually leads to out-of-bound * access by the imageblit function. * To give the correct values to the struct and to not have * to deal with possible errors from the code below, we call * the resize_screen here as well. */ return resize_screen(vc, new_cols, new_rows, from_user); } if (new_screen_size > KMALLOC_MAX_SIZE || !new_screen_size) return -EINVAL; newscreen = kzalloc(new_screen_size, GFP_USER); if (!newscreen) return -ENOMEM; if (vc->vc_uni_lines) { new_uniscr = vc_uniscr_alloc(new_cols, new_rows); if (!new_uniscr) { kfree(newscreen); return -ENOMEM; } } if (vc_is_sel(vc)) clear_selection(); old_rows = vc->vc_rows; old_row_size = vc->vc_size_row; err = resize_screen(vc, new_cols, new_rows, from_user); if (err) { kfree(newscreen); vc_uniscr_free(new_uniscr); return err; } vc->vc_rows = new_rows; vc->vc_cols = new_cols; vc->vc_size_row = new_row_size; vc->vc_screenbuf_size = new_screen_size; rlth = min(old_row_size, new_row_size); rrem = new_row_size - rlth; old_origin = vc->vc_origin; new_origin = (long) newscreen; new_scr_end = new_origin + new_screen_size; if (vc->state.y > new_rows) { if (old_rows - vc->state.y < new_rows) { /* * Cursor near the bottom, copy contents from the * bottom of buffer */ first_copied_row = (old_rows - new_rows); } else { /* * Cursor is in no man's land, copy 1/2 screenful * from the top and bottom of cursor position */ first_copied_row = (vc->state.y - new_rows/2); } old_origin += first_copied_row * old_row_size; } else first_copied_row = 0; end = old_origin + old_row_size * min(old_rows, new_rows); vc_uniscr_copy_area(new_uniscr, new_cols, new_rows, vc->vc_uni_lines, rlth/2, first_copied_row, min(old_rows, new_rows)); vc_uniscr_set(vc, new_uniscr); update_attr(vc); while (old_origin < end) { scr_memcpyw((unsigned short *) new_origin, (unsigned short *) old_origin, rlth); if (rrem) scr_memsetw((void *)(new_origin + rlth), vc->vc_video_erase_char, rrem); old_origin += old_row_size; new_origin += new_row_size; } if (new_scr_end > new_origin) scr_memsetw((void *)new_origin, vc->vc_video_erase_char, new_scr_end - new_origin); oldscreen = vc->vc_screenbuf; vc->vc_screenbuf = newscreen; vc->vc_screenbuf_size = new_screen_size; set_origin(vc); kfree(oldscreen); /* do part of a reset_terminal() */ vc->vc_top = 0; vc->vc_bottom = vc->vc_rows; gotoxy(vc, vc->state.x, vc->state.y); save_cur(vc); if (tty) { /* Rewrite the requested winsize data with the actual resulting sizes */ struct winsize ws; memset(&ws, 0, sizeof(ws)); ws.ws_row = vc->vc_rows; ws.ws_col = vc->vc_cols; ws.ws_ypixel = vc->vc_scan_lines; tty_do_resize(tty, &ws); } if (con_is_visible(vc)) update_screen(vc); vt_event_post(VT_EVENT_RESIZE, vc->vc_num, vc->vc_num); notify_update(vc); return err; } /** * __vc_resize - resize a VT * @vc: virtual console * @cols: columns * @rows: rows * @from_user: invoked by a user? * * Resize a virtual console as seen from the console end of things. We use the * common vc_do_resize() method to update the structures. * * Locking: The caller must hold the console sem to protect console internals * and @vc->port.tty. */ int __vc_resize(struct vc_data *vc, unsigned int cols, unsigned int rows, bool from_user) { return vc_do_resize(vc->port.tty, vc, cols, rows, from_user); } EXPORT_SYMBOL(__vc_resize); /** * vt_resize - resize a VT * @tty: tty to resize * @ws: winsize attributes * * Resize a virtual terminal. This is called by the tty layer as we register * our own handler for resizing. The mutual helper does all the actual work. * * Locking: Takes the console sem and the called methods then take the tty * termios_rwsem and the tty ctrl.lock in that order. */ static int vt_resize(struct tty_struct *tty, struct winsize *ws) { struct vc_data *vc = tty->driver_data; int ret; console_lock(); ret = vc_do_resize(tty, vc, ws->ws_col, ws->ws_row, false); console_unlock(); return ret; } struct vc_data *vc_deallocate(unsigned int currcons) { struct vc_data *vc = NULL; WARN_CONSOLE_UNLOCKED(); if (vc_cons_allocated(currcons)) { struct vt_notifier_param param; param.vc = vc = vc_cons[currcons].d; atomic_notifier_call_chain(&vt_notifier_list, VT_DEALLOCATE, ¶m); vcs_remove_sysfs(currcons); visual_deinit(vc); con_free_unimap(vc); put_pid(vc->vt_pid); vc_uniscr_set(vc, NULL); kfree(vc->vc_screenbuf); vc_cons[currcons].d = NULL; } return vc; } /* * VT102 emulator */ enum { EPecma = 0, EPdec, EPeq, EPgt, EPlt}; #define set_kbd(vc, x) vt_set_kbd_mode_bit((vc)->vc_num, (x)) #define clr_kbd(vc, x) vt_clr_kbd_mode_bit((vc)->vc_num, (x)) #define is_kbd(vc, x) vt_get_kbd_mode_bit((vc)->vc_num, (x)) #define decarm VC_REPEAT #define decckm VC_CKMODE #define kbdapplic VC_APPLIC #define lnm VC_CRLF const unsigned char color_table[] = { 0, 4, 2, 6, 1, 5, 3, 7, 8,12,10,14, 9,13,11,15 }; EXPORT_SYMBOL(color_table); /* the default colour table, for VGA+ colour systems */ unsigned char default_red[] = { 0x00, 0xaa, 0x00, 0xaa, 0x00, 0xaa, 0x00, 0xaa, 0x55, 0xff, 0x55, 0xff, 0x55, 0xff, 0x55, 0xff }; module_param_array(default_red, byte, NULL, S_IRUGO | S_IWUSR); EXPORT_SYMBOL(default_red); unsigned char default_grn[] = { 0x00, 0x00, 0xaa, 0x55, 0x00, 0x00, 0xaa, 0xaa, 0x55, 0x55, 0xff, 0xff, 0x55, 0x55, 0xff, 0xff }; module_param_array(default_grn, byte, NULL, S_IRUGO | S_IWUSR); EXPORT_SYMBOL(default_grn); unsigned char default_blu[] = { 0x00, 0x00, 0x00, 0x00, 0xaa, 0xaa, 0xaa, 0xaa, 0x55, 0x55, 0x55, 0x55, 0xff, 0xff, 0xff, 0xff }; module_param_array(default_blu, byte, NULL, S_IRUGO | S_IWUSR); EXPORT_SYMBOL(default_blu); /* * gotoxy() must verify all boundaries, because the arguments * might also be negative. If the given position is out of * bounds, the cursor is placed at the nearest margin. */ static void gotoxy(struct vc_data *vc, int new_x, int new_y) { int min_y, max_y; if (new_x < 0) vc->state.x = 0; else { if (new_x >= vc->vc_cols) vc->state.x = vc->vc_cols - 1; else vc->state.x = new_x; } if (vc->vc_decom) { min_y = vc->vc_top; max_y = vc->vc_bottom; } else { min_y = 0; max_y = vc->vc_rows; } if (new_y < min_y) vc->state.y = min_y; else if (new_y >= max_y) vc->state.y = max_y - 1; else vc->state.y = new_y; vc->vc_pos = vc->vc_origin + vc->state.y * vc->vc_size_row + (vc->state.x << 1); vc->vc_need_wrap = 0; } /* for absolute user moves, when decom is set */ static void gotoxay(struct vc_data *vc, int new_x, int new_y) { gotoxy(vc, new_x, vc->vc_decom ? (vc->vc_top + new_y) : new_y); } void scrollback(struct vc_data *vc) { scrolldelta(-(vc->vc_rows / 2)); } void scrollfront(struct vc_data *vc, int lines) { if (!lines) lines = vc->vc_rows / 2; scrolldelta(lines); } static void lf(struct vc_data *vc) { /* don't scroll if above bottom of scrolling region, or * if below scrolling region */ if (vc->state.y + 1 == vc->vc_bottom) con_scroll(vc, vc->vc_top, vc->vc_bottom, SM_UP, 1); else if (vc->state.y < vc->vc_rows - 1) { vc->state.y++; vc->vc_pos += vc->vc_size_row; } vc->vc_need_wrap = 0; notify_write(vc, '\n'); } static void ri(struct vc_data *vc) { /* don't scroll if below top of scrolling region, or * if above scrolling region */ if (vc->state.y == vc->vc_top) con_scroll(vc, vc->vc_top, vc->vc_bottom, SM_DOWN, 1); else if (vc->state.y > 0) { vc->state.y--; vc->vc_pos -= vc->vc_size_row; } vc->vc_need_wrap = 0; } static inline void cr(struct vc_data *vc) { vc->vc_pos -= vc->state.x << 1; vc->vc_need_wrap = vc->state.x = 0; notify_write(vc, '\r'); } static inline void bs(struct vc_data *vc) { if (vc->state.x) { vc->vc_pos -= 2; vc->state.x--; vc->vc_need_wrap = 0; notify_write(vc, '\b'); } } static inline void del(struct vc_data *vc) { /* ignored */ } enum CSI_J { CSI_J_CURSOR_TO_END = 0, CSI_J_START_TO_CURSOR = 1, CSI_J_VISIBLE = 2, CSI_J_FULL = 3, }; static void csi_J(struct vc_data *vc, enum CSI_J vpar) { unsigned short *start; unsigned int count; switch (vpar) { case CSI_J_CURSOR_TO_END: vc_uniscr_clear_line(vc, vc->state.x, vc->vc_cols - vc->state.x); vc_uniscr_clear_lines(vc, vc->state.y + 1, vc->vc_rows - vc->state.y - 1); count = (vc->vc_scr_end - vc->vc_pos) >> 1; start = (unsigned short *)vc->vc_pos; break; case CSI_J_START_TO_CURSOR: vc_uniscr_clear_line(vc, 0, vc->state.x + 1); vc_uniscr_clear_lines(vc, 0, vc->state.y); count = ((vc->vc_pos - vc->vc_origin) >> 1) + 1; start = (unsigned short *)vc->vc_origin; break; case CSI_J_FULL: flush_scrollback(vc); fallthrough; case CSI_J_VISIBLE: vc_uniscr_clear_lines(vc, 0, vc->vc_rows); count = vc->vc_cols * vc->vc_rows; start = (unsigned short *)vc->vc_origin; break; default: return; } scr_memsetw(start, vc->vc_video_erase_char, 2 * count); if (con_should_update(vc)) do_update_region(vc, (unsigned long) start, count); vc->vc_need_wrap = 0; } enum { CSI_K_CURSOR_TO_LINEEND = 0, CSI_K_LINESTART_TO_CURSOR = 1, CSI_K_LINE = 2, }; static void csi_K(struct vc_data *vc) { unsigned int count; unsigned short *start = (unsigned short *)vc->vc_pos; int offset; switch (vc->vc_par[0]) { case CSI_K_CURSOR_TO_LINEEND: offset = 0; count = vc->vc_cols - vc->state.x; break; case CSI_K_LINESTART_TO_CURSOR: offset = -vc->state.x; count = vc->state.x + 1; break; case CSI_K_LINE: offset = -vc->state.x; count = vc->vc_cols; break; default: return; } vc_uniscr_clear_line(vc, vc->state.x + offset, count); scr_memsetw(start + offset, vc->vc_video_erase_char, 2 * count); vc->vc_need_wrap = 0; if (con_should_update(vc)) do_update_region(vc, (unsigned long)(start + offset), count); } /* erase the following count positions */ static void csi_X(struct vc_data *vc) { /* not vt100? */ unsigned int count = clamp(vc->vc_par[0], 1, vc->vc_cols - vc->state.x); vc_uniscr_clear_line(vc, vc->state.x, count); scr_memsetw((unsigned short *)vc->vc_pos, vc->vc_video_erase_char, 2 * count); if (con_should_update(vc)) vc->vc_sw->con_clear(vc, vc->state.y, vc->state.x, count); vc->vc_need_wrap = 0; } static void default_attr(struct vc_data *vc) { vc->state.intensity = VCI_NORMAL; vc->state.italic = false; vc->state.underline = false; vc->state.reverse = false; vc->state.blink = false; vc->state.color = vc->vc_def_color; } struct rgb { u8 r; u8 g; u8 b; }; static void rgb_from_256(unsigned int i, struct rgb *c) { if (i < 8) { /* Standard colours. */ c->r = i&1 ? 0xaa : 0x00; c->g = i&2 ? 0xaa : 0x00; c->b = i&4 ? 0xaa : 0x00; } else if (i < 16) { c->r = i&1 ? 0xff : 0x55; c->g = i&2 ? 0xff : 0x55; c->b = i&4 ? 0xff : 0x55; } else if (i < 232) { /* 6x6x6 colour cube. */ i -= 16; c->b = i % 6 * 255 / 6; i /= 6; c->g = i % 6 * 255 / 6; i /= 6; c->r = i * 255 / 6; } else /* Grayscale ramp. */ c->r = c->g = c->b = i * 10 - 2312; } static void rgb_foreground(struct vc_data *vc, const struct rgb *c) { u8 hue = 0, max = max3(c->r, c->g, c->b); if (c->r > max / 2) hue |= 4; if (c->g > max / 2) hue |= 2; if (c->b > max / 2) hue |= 1; if (hue == 7 && max <= 0x55) { hue = 0; vc->state.intensity = VCI_BOLD; } else if (max > 0xaa) vc->state.intensity = VCI_BOLD; else vc->state.intensity = VCI_NORMAL; vc->state.color = (vc->state.color & 0xf0) | hue; } static void rgb_background(struct vc_data *vc, const struct rgb *c) { /* For backgrounds, err on the dark side. */ vc->state.color = (vc->state.color & 0x0f) | (c->r&0x80) >> 1 | (c->g&0x80) >> 2 | (c->b&0x80) >> 3; } /* * ITU T.416 Higher colour modes. They break the usual properties of SGR codes * and thus need to be detected and ignored by hand. That standard also * wants : rather than ; as separators but sequences containing : are currently * completely ignored by the parser. * * Subcommands 3 (CMY) and 4 (CMYK) are so insane there's no point in * supporting them. */ static int vc_t416_color(struct vc_data *vc, int i, void(*set_color)(struct vc_data *vc, const struct rgb *c)) { struct rgb c; i++; if (i > vc->vc_npar) return i; if (vc->vc_par[i] == 5 && i + 1 <= vc->vc_npar) { /* 256 colours */ i++; rgb_from_256(vc->vc_par[i], &c); } else if (vc->vc_par[i] == 2 && i + 3 <= vc->vc_npar) { /* 24 bit */ c.r = vc->vc_par[i + 1]; c.g = vc->vc_par[i + 2]; c.b = vc->vc_par[i + 3]; i += 3; } else return i; set_color(vc, &c); return i; } enum { CSI_m_DEFAULT = 0, CSI_m_BOLD = 1, CSI_m_HALF_BRIGHT = 2, CSI_m_ITALIC = 3, CSI_m_UNDERLINE = 4, CSI_m_BLINK = 5, CSI_m_REVERSE = 7, CSI_m_PRI_FONT = 10, CSI_m_ALT_FONT1 = 11, CSI_m_ALT_FONT2 = 12, CSI_m_DOUBLE_UNDERLINE = 21, CSI_m_NORMAL_INTENSITY = 22, CSI_m_NO_ITALIC = 23, CSI_m_NO_UNDERLINE = 24, CSI_m_NO_BLINK = 25, CSI_m_NO_REVERSE = 27, CSI_m_FG_COLOR_BEG = 30, CSI_m_FG_COLOR_END = 37, CSI_m_FG_COLOR = 38, CSI_m_DEFAULT_FG_COLOR = 39, CSI_m_BG_COLOR_BEG = 40, CSI_m_BG_COLOR_END = 47, CSI_m_BG_COLOR = 48, CSI_m_DEFAULT_BG_COLOR = 49, CSI_m_BRIGHT_FG_COLOR_BEG = 90, CSI_m_BRIGHT_FG_COLOR_END = 97, CSI_m_BRIGHT_FG_COLOR_OFF = CSI_m_BRIGHT_FG_COLOR_BEG - CSI_m_FG_COLOR_BEG, CSI_m_BRIGHT_BG_COLOR_BEG = 100, CSI_m_BRIGHT_BG_COLOR_END = 107, CSI_m_BRIGHT_BG_COLOR_OFF = CSI_m_BRIGHT_BG_COLOR_BEG - CSI_m_BG_COLOR_BEG, }; /* console_lock is held */ static void csi_m(struct vc_data *vc) { int i; for (i = 0; i <= vc->vc_npar; i++) switch (vc->vc_par[i]) { case CSI_m_DEFAULT: /* all attributes off */ default_attr(vc); break; case CSI_m_BOLD: vc->state.intensity = VCI_BOLD; break; case CSI_m_HALF_BRIGHT: vc->state.intensity = VCI_HALF_BRIGHT; break; case CSI_m_ITALIC: vc->state.italic = true; break; case CSI_m_DOUBLE_UNDERLINE: /* * No console drivers support double underline, so * convert it to a single underline. */ case CSI_m_UNDERLINE: vc->state.underline = true; break; case CSI_m_BLINK: vc->state.blink = true; break; case CSI_m_REVERSE: vc->state.reverse = true; break; case CSI_m_PRI_FONT: /* ANSI X3.64-1979 (SCO-ish?) * Select primary font, don't display control chars if * defined, don't set bit 8 on output. */ vc->vc_translate = set_translate(vc->state.Gx_charset[vc->state.charset], vc); vc->vc_disp_ctrl = 0; vc->vc_toggle_meta = 0; break; case CSI_m_ALT_FONT1: /* ANSI X3.64-1979 (SCO-ish?) * Select first alternate font, lets chars < 32 be * displayed as ROM chars. */ vc->vc_translate = set_translate(IBMPC_MAP, vc); vc->vc_disp_ctrl = 1; vc->vc_toggle_meta = 0; break; case CSI_m_ALT_FONT2: /* ANSI X3.64-1979 (SCO-ish?) * Select second alternate font, toggle high bit * before displaying as ROM char. */ vc->vc_translate = set_translate(IBMPC_MAP, vc); vc->vc_disp_ctrl = 1; vc->vc_toggle_meta = 1; break; case CSI_m_NORMAL_INTENSITY: vc->state.intensity = VCI_NORMAL; break; case CSI_m_NO_ITALIC: vc->state.italic = false; break; case CSI_m_NO_UNDERLINE: vc->state.underline = false; break; case CSI_m_NO_BLINK: vc->state.blink = false; break; case CSI_m_NO_REVERSE: vc->state.reverse = false; break; case CSI_m_FG_COLOR: i = vc_t416_color(vc, i, rgb_foreground); break; case CSI_m_BG_COLOR: i = vc_t416_color(vc, i, rgb_background); break; case CSI_m_DEFAULT_FG_COLOR: vc->state.color = (vc->vc_def_color & 0x0f) | (vc->state.color & 0xf0); break; case CSI_m_DEFAULT_BG_COLOR: vc->state.color = (vc->vc_def_color & 0xf0) | (vc->state.color & 0x0f); break; case CSI_m_BRIGHT_FG_COLOR_BEG ... CSI_m_BRIGHT_FG_COLOR_END: vc->state.intensity = VCI_BOLD; vc->vc_par[i] -= CSI_m_BRIGHT_FG_COLOR_OFF; fallthrough; case CSI_m_FG_COLOR_BEG ... CSI_m_FG_COLOR_END: vc->vc_par[i] -= CSI_m_FG_COLOR_BEG; vc->state.color = color_table[vc->vc_par[i]] | (vc->state.color & 0xf0); break; case CSI_m_BRIGHT_BG_COLOR_BEG ... CSI_m_BRIGHT_BG_COLOR_END: vc->vc_par[i] -= CSI_m_BRIGHT_BG_COLOR_OFF; fallthrough; case CSI_m_BG_COLOR_BEG ... CSI_m_BG_COLOR_END: vc->vc_par[i] -= CSI_m_BG_COLOR_BEG; vc->state.color = (color_table[vc->vc_par[i]] << 4) | (vc->state.color & 0x0f); break; } update_attr(vc); } static void respond_string(const char *p, size_t len, struct tty_port *port) { tty_insert_flip_string(port, p, len); tty_flip_buffer_push(port); } static void cursor_report(struct vc_data *vc, struct tty_struct *tty) { char buf[40]; int len; len = sprintf(buf, "\033[%d;%dR", vc->state.y + (vc->vc_decom ? vc->vc_top + 1 : 1), vc->state.x + 1); respond_string(buf, len, tty->port); } static inline void status_report(struct tty_struct *tty) { static const char teminal_ok[] = "\033[0n"; respond_string(teminal_ok, strlen(teminal_ok), tty->port); } static inline void respond_ID(struct tty_struct *tty) { /* terminal answer to an ESC-Z or csi0c query. */ static const char vt102_id[] = "\033[?6c"; respond_string(vt102_id, strlen(vt102_id), tty->port); } void mouse_report(struct tty_struct *tty, int butt, int mrx, int mry) { char buf[8]; int len; len = sprintf(buf, "\033[M%c%c%c", (char)(' ' + butt), (char)('!' + mrx), (char)('!' + mry)); respond_string(buf, len, tty->port); } /* invoked via ioctl(TIOCLINUX) and through set_selection_user */ int mouse_reporting(void) { return vc_cons[fg_console].d->vc_report_mouse; } /* invoked via ioctl(TIOCLINUX) */ static int get_bracketed_paste(struct tty_struct *tty) { struct vc_data *vc = tty->driver_data; return vc->vc_bracketed_paste; } enum { CSI_DEC_hl_CURSOR_KEYS = 1, /* CKM: cursor keys send ^[Ox/^[[x */ CSI_DEC_hl_132_COLUMNS = 3, /* COLM: 80/132 mode switch */ CSI_DEC_hl_REVERSE_VIDEO = 5, /* SCNM */ CSI_DEC_hl_ORIGIN_MODE = 6, /* OM: origin relative/absolute */ CSI_DEC_hl_AUTOWRAP = 7, /* AWM */ CSI_DEC_hl_AUTOREPEAT = 8, /* ARM */ CSI_DEC_hl_MOUSE_X10 = 9, CSI_DEC_hl_SHOW_CURSOR = 25, /* TCEM */ CSI_DEC_hl_MOUSE_VT200 = 1000, CSI_DEC_hl_BRACKETED_PASTE = 2004, }; /* console_lock is held */ static void csi_DEC_hl(struct vc_data *vc, bool on_off) { unsigned int i; for (i = 0; i <= vc->vc_npar; i++) switch (vc->vc_par[i]) { case CSI_DEC_hl_CURSOR_KEYS: if (on_off) set_kbd(vc, decckm); else clr_kbd(vc, decckm); break; case CSI_DEC_hl_132_COLUMNS: /* unimplemented */ #if 0 vc_resize(deccolm ? 132 : 80, vc->vc_rows); /* this alone does not suffice; some user mode utility has to change the hardware regs */ #endif break; case CSI_DEC_hl_REVERSE_VIDEO: if (vc->vc_decscnm != on_off) { vc->vc_decscnm = on_off; invert_screen(vc, 0, vc->vc_screenbuf_size, false); update_attr(vc); } break; case CSI_DEC_hl_ORIGIN_MODE: vc->vc_decom = on_off; gotoxay(vc, 0, 0); break; case CSI_DEC_hl_AUTOWRAP: vc->vc_decawm = on_off; break; case CSI_DEC_hl_AUTOREPEAT: if (on_off) set_kbd(vc, decarm); else clr_kbd(vc, decarm); break; case CSI_DEC_hl_MOUSE_X10: vc->vc_report_mouse = on_off ? 1 : 0; break; case CSI_DEC_hl_SHOW_CURSOR: vc->vc_deccm = on_off; break; case CSI_DEC_hl_MOUSE_VT200: vc->vc_report_mouse = on_off ? 2 : 0; break; case CSI_DEC_hl_BRACKETED_PASTE: vc->vc_bracketed_paste = on_off; break; } } enum { CSI_hl_DISPLAY_CTRL = 3, /* handle ansi control chars */ CSI_hl_INSERT = 4, /* IRM: insert/replace */ CSI_hl_AUTO_NL = 20, /* LNM: Enter == CrLf/Lf */ }; /* console_lock is held */ static void csi_hl(struct vc_data *vc, bool on_off) { unsigned int i; for (i = 0; i <= vc->vc_npar; i++) switch (vc->vc_par[i]) { /* ANSI modes set/reset */ case CSI_hl_DISPLAY_CTRL: vc->vc_disp_ctrl = on_off; break; case CSI_hl_INSERT: vc->vc_decim = on_off; break; case CSI_hl_AUTO_NL: if (on_off) set_kbd(vc, lnm); else clr_kbd(vc, lnm); break; } } enum CSI_right_square_bracket { CSI_RSB_COLOR_FOR_UNDERLINE = 1, CSI_RSB_COLOR_FOR_HALF_BRIGHT = 2, CSI_RSB_MAKE_CUR_COLOR_DEFAULT = 8, CSI_RSB_BLANKING_INTERVAL = 9, CSI_RSB_BELL_FREQUENCY = 10, CSI_RSB_BELL_DURATION = 11, CSI_RSB_BRING_CONSOLE_TO_FRONT = 12, CSI_RSB_UNBLANK = 13, CSI_RSB_VESA_OFF_INTERVAL = 14, CSI_RSB_BRING_PREV_CONSOLE_TO_FRONT = 15, CSI_RSB_CURSOR_BLINK_INTERVAL = 16, }; /* * csi_RSB - csi+] (Right Square Bracket) handler * * These are linux console private sequences. * * console_lock is held */ static void csi_RSB(struct vc_data *vc) { switch (vc->vc_par[0]) { case CSI_RSB_COLOR_FOR_UNDERLINE: if (vc->vc_can_do_color && vc->vc_par[1] < 16) { vc->vc_ulcolor = color_table[vc->vc_par[1]]; if (vc->state.underline) update_attr(vc); } break; case CSI_RSB_COLOR_FOR_HALF_BRIGHT: if (vc->vc_can_do_color && vc->vc_par[1] < 16) { vc->vc_halfcolor = color_table[vc->vc_par[1]]; if (vc->state.intensity == VCI_HALF_BRIGHT) update_attr(vc); } break; case CSI_RSB_MAKE_CUR_COLOR_DEFAULT: vc->vc_def_color = vc->vc_attr; if (vc->vc_hi_font_mask == 0x100) vc->vc_def_color >>= 1; default_attr(vc); update_attr(vc); break; case CSI_RSB_BLANKING_INTERVAL: blankinterval = min(vc->vc_par[1], 60U) * 60; poke_blanked_console(); break; case CSI_RSB_BELL_FREQUENCY: if (vc->vc_npar >= 1) vc->vc_bell_pitch = vc->vc_par[1]; else vc->vc_bell_pitch = DEFAULT_BELL_PITCH; break; case CSI_RSB_BELL_DURATION: if (vc->vc_npar >= 1) vc->vc_bell_duration = (vc->vc_par[1] < 2000) ? msecs_to_jiffies(vc->vc_par[1]) : 0; else vc->vc_bell_duration = DEFAULT_BELL_DURATION; break; case CSI_RSB_BRING_CONSOLE_TO_FRONT: if (vc->vc_par[1] >= 1 && vc_cons_allocated(vc->vc_par[1] - 1)) set_console(vc->vc_par[1] - 1); break; case CSI_RSB_UNBLANK: poke_blanked_console(); break; case CSI_RSB_VESA_OFF_INTERVAL: vesa_off_interval = min(vc->vc_par[1], 60U) * 60 * HZ; break; case CSI_RSB_BRING_PREV_CONSOLE_TO_FRONT: set_console(last_console); break; case CSI_RSB_CURSOR_BLINK_INTERVAL: if (vc->vc_npar >= 1 && vc->vc_par[1] >= 50 && vc->vc_par[1] <= USHRT_MAX) vc->vc_cur_blink_ms = vc->vc_par[1]; else vc->vc_cur_blink_ms = DEFAULT_CURSOR_BLINK_MS; break; } } /* console_lock is held */ static void csi_at(struct vc_data *vc, unsigned int nr) { nr = clamp(nr, 1, vc->vc_cols - vc->state.x); insert_char(vc, nr); } /* console_lock is held */ static void csi_L(struct vc_data *vc) { unsigned int nr = clamp(vc->vc_par[0], 1, vc->vc_rows - vc->state.y); con_scroll(vc, vc->state.y, vc->vc_bottom, SM_DOWN, nr); vc->vc_need_wrap = 0; } /* console_lock is held */ static void csi_P(struct vc_data *vc) { unsigned int nr = clamp(vc->vc_par[0], 1, vc->vc_cols - vc->state.x); delete_char(vc, nr); } /* console_lock is held */ static void csi_M(struct vc_data *vc) { unsigned int nr = clamp(vc->vc_par[0], 1, vc->vc_rows - vc->state.y); con_scroll(vc, vc->state.y, vc->vc_bottom, SM_UP, nr); vc->vc_need_wrap = 0; } /* console_lock is held (except via vc_init->reset_terminal */ static void save_cur(struct vc_data *vc) { memcpy(&vc->saved_state, &vc->state, sizeof(vc->state)); } /* console_lock is held */ static void restore_cur(struct vc_data *vc) { memcpy(&vc->state, &vc->saved_state, sizeof(vc->state)); gotoxy(vc, vc->state.x, vc->state.y); vc->vc_translate = set_translate(vc->state.Gx_charset[vc->state.charset], vc); update_attr(vc); vc->vc_need_wrap = 0; } /** * enum vc_ctl_state - control characters state of a vt * * @ESnormal: initial state, no control characters parsed * @ESesc: ESC parsed * @ESsquare: CSI parsed -- modifiers/parameters/ctrl chars expected * @ESgetpars: CSI parsed -- parameters/ctrl chars expected * @ESfunckey: CSI [ parsed * @EShash: ESC # parsed * @ESsetG0: ESC ( parsed * @ESsetG1: ESC ) parsed * @ESpercent: ESC % parsed * @EScsiignore: CSI [0x20-0x3f] parsed * @ESnonstd: OSC parsed * @ESpalette: OSC P parsed * @ESosc: OSC [0-9] parsed * @ESANSI_first: first state for ignoring ansi control sequences * @ESapc: ESC _ parsed * @ESpm: ESC ^ parsed * @ESdcs: ESC P parsed * @ESANSI_last: last state for ignoring ansi control sequences */ enum vc_ctl_state { ESnormal, ESesc, ESsquare, ESgetpars, ESfunckey, EShash, ESsetG0, ESsetG1, ESpercent, EScsiignore, ESnonstd, ESpalette, ESosc, ESANSI_first = ESosc, ESapc, ESpm, ESdcs, ESANSI_last = ESdcs, }; /* console_lock is held (except via vc_init()) */ static void reset_terminal(struct vc_data *vc, int do_clear) { unsigned int i; vc->vc_top = 0; vc->vc_bottom = vc->vc_rows; vc->vc_state = ESnormal; vc->vc_priv = EPecma; vc->vc_translate = set_translate(LAT1_MAP, vc); vc->state.Gx_charset[0] = LAT1_MAP; vc->state.Gx_charset[1] = GRAF_MAP; vc->state.charset = 0; vc->vc_need_wrap = 0; vc->vc_report_mouse = 0; vc->vc_bracketed_paste = 0; vc->vc_utf = default_utf8; vc->vc_utf_count = 0; vc->vc_disp_ctrl = 0; vc->vc_toggle_meta = 0; vc->vc_decscnm = 0; vc->vc_decom = 0; vc->vc_decawm = 1; vc->vc_deccm = global_cursor_default; vc->vc_decim = 0; vt_reset_keyboard(vc->vc_num); vc->vc_cursor_type = cur_default; vc->vc_complement_mask = vc->vc_s_complement_mask; default_attr(vc); update_attr(vc); bitmap_zero(vc->vc_tab_stop, VC_TABSTOPS_COUNT); for (i = 0; i < VC_TABSTOPS_COUNT; i += 8) set_bit(i, vc->vc_tab_stop); vc->vc_bell_pitch = DEFAULT_BELL_PITCH; vc->vc_bell_duration = DEFAULT_BELL_DURATION; vc->vc_cur_blink_ms = DEFAULT_CURSOR_BLINK_MS; gotoxy(vc, 0, 0); save_cur(vc); if (do_clear) csi_J(vc, CSI_J_VISIBLE); } static void vc_setGx(struct vc_data *vc, unsigned int which, u8 c) { unsigned char *charset = &vc->state.Gx_charset[which]; switch (c) { case '0': *charset = GRAF_MAP; break; case 'B': *charset = LAT1_MAP; break; case 'U': *charset = IBMPC_MAP; break; case 'K': *charset = USER_MAP; break; } if (vc->state.charset == which) vc->vc_translate = set_translate(*charset, vc); } static bool ansi_control_string(enum vc_ctl_state state) { return state >= ESANSI_first && state <= ESANSI_last; } enum { ASCII_NULL = 0, ASCII_BELL = 7, ASCII_BACKSPACE = 8, ASCII_IGNORE_FIRST = ASCII_BACKSPACE, ASCII_HTAB = 9, ASCII_LINEFEED = 10, ASCII_VTAB = 11, ASCII_FORMFEED = 12, ASCII_CAR_RET = 13, ASCII_IGNORE_LAST = ASCII_CAR_RET, ASCII_SHIFTOUT = 14, ASCII_SHIFTIN = 15, ASCII_CANCEL = 24, ASCII_SUBSTITUTE = 26, ASCII_ESCAPE = 27, ASCII_CSI_IGNORE_FIRST = ' ', /* 0x2x, 0x3a and 0x3c - 0x3f */ ASCII_CSI_IGNORE_LAST = '?', ASCII_DEL = 127, ASCII_EXT_CSI = 128 + ASCII_ESCAPE, }; /* * Handle ascii characters in control sequences and change states accordingly. * E.g. ESC sets the state of vc to ESesc. * * Returns: true if @c handled. */ static bool handle_ascii(struct tty_struct *tty, struct vc_data *vc, u8 c) { switch (c) { case ASCII_NULL: return true; case ASCII_BELL: if (ansi_control_string(vc->vc_state)) vc->vc_state = ESnormal; else if (vc->vc_bell_duration) kd_mksound(vc->vc_bell_pitch, vc->vc_bell_duration); return true; case ASCII_BACKSPACE: bs(vc); return true; case ASCII_HTAB: vc->vc_pos -= (vc->state.x << 1); vc->state.x = find_next_bit(vc->vc_tab_stop, min(vc->vc_cols - 1, VC_TABSTOPS_COUNT), vc->state.x + 1); if (vc->state.x >= VC_TABSTOPS_COUNT) vc->state.x = vc->vc_cols - 1; vc->vc_pos += (vc->state.x << 1); notify_write(vc, '\t'); return true; case ASCII_LINEFEED: case ASCII_VTAB: case ASCII_FORMFEED: lf(vc); if (!is_kbd(vc, lnm)) return true; fallthrough; case ASCII_CAR_RET: cr(vc); return true; case ASCII_SHIFTOUT: vc->state.charset = 1; vc->vc_translate = set_translate(vc->state.Gx_charset[1], vc); vc->vc_disp_ctrl = 1; return true; case ASCII_SHIFTIN: vc->state.charset = 0; vc->vc_translate = set_translate(vc->state.Gx_charset[0], vc); vc->vc_disp_ctrl = 0; return true; case ASCII_CANCEL: case ASCII_SUBSTITUTE: vc->vc_state = ESnormal; return true; case ASCII_ESCAPE: vc->vc_state = ESesc; return true; case ASCII_DEL: del(vc); return true; case ASCII_EXT_CSI: vc->vc_state = ESsquare; return true; } return false; } /* * Handle a character (@c) following an ESC (when @vc is in the ESesc state). * E.g. previous ESC with @c == '[' here yields the ESsquare state (that is: * CSI). */ static void handle_esc(struct tty_struct *tty, struct vc_data *vc, u8 c) { vc->vc_state = ESnormal; switch (c) { case '[': vc->vc_state = ESsquare; break; case ']': vc->vc_state = ESnonstd; break; case '_': vc->vc_state = ESapc; break; case '^': vc->vc_state = ESpm; break; case '%': vc->vc_state = ESpercent; break; case 'E': cr(vc); lf(vc); break; case 'M': ri(vc); break; case 'D': lf(vc); break; case 'H': if (vc->state.x < VC_TABSTOPS_COUNT) set_bit(vc->state.x, vc->vc_tab_stop); break; case 'P': vc->vc_state = ESdcs; break; case 'Z': respond_ID(tty); break; case '7': save_cur(vc); break; case '8': restore_cur(vc); break; case '(': vc->vc_state = ESsetG0; break; case ')': vc->vc_state = ESsetG1; break; case '#': vc->vc_state = EShash; break; case 'c': reset_terminal(vc, 1); break; case '>': /* Numeric keypad */ clr_kbd(vc, kbdapplic); break; case '=': /* Appl. keypad */ set_kbd(vc, kbdapplic); break; } } /* * Handle special DEC control sequences ("ESC [ ? parameters char"). Parameters * are in @vc->vc_par and the char is in @c here. */ static void csi_DEC(struct tty_struct *tty, struct vc_data *vc, u8 c) { switch (c) { case 'h': csi_DEC_hl(vc, true); break; case 'l': csi_DEC_hl(vc, false); break; case 'c': if (vc->vc_par[0]) vc->vc_cursor_type = CUR_MAKE(vc->vc_par[0], vc->vc_par[1], vc->vc_par[2]); else vc->vc_cursor_type = cur_default; break; case 'm': clear_selection(); if (vc->vc_par[0]) vc->vc_complement_mask = vc->vc_par[0] << 8 | vc->vc_par[1]; else vc->vc_complement_mask = vc->vc_s_complement_mask; break; case 'n': if (vc->vc_par[0] == 5) status_report(tty); else if (vc->vc_par[0] == 6) cursor_report(vc, tty); break; } } /* * Handle Control Sequence Introducer control characters. That is * "ESC [ parameters char". Parameters are in @vc->vc_par and the char is in * @c here. */ static void csi_ECMA(struct tty_struct *tty, struct vc_data *vc, u8 c) { switch (c) { case 'G': case '`': if (vc->vc_par[0]) vc->vc_par[0]--; gotoxy(vc, vc->vc_par[0], vc->state.y); break; case 'A': if (!vc->vc_par[0]) vc->vc_par[0]++; gotoxy(vc, vc->state.x, vc->state.y - vc->vc_par[0]); break; case 'B': case 'e': if (!vc->vc_par[0]) vc->vc_par[0]++; gotoxy(vc, vc->state.x, vc->state.y + vc->vc_par[0]); break; case 'C': case 'a': if (!vc->vc_par[0]) vc->vc_par[0]++; gotoxy(vc, vc->state.x + vc->vc_par[0], vc->state.y); break; case 'D': if (!vc->vc_par[0]) vc->vc_par[0]++; gotoxy(vc, vc->state.x - vc->vc_par[0], vc->state.y); break; case 'E': if (!vc->vc_par[0]) vc->vc_par[0]++; gotoxy(vc, 0, vc->state.y + vc->vc_par[0]); break; case 'F': if (!vc->vc_par[0]) vc->vc_par[0]++; gotoxy(vc, 0, vc->state.y - vc->vc_par[0]); break; case 'd': if (vc->vc_par[0]) vc->vc_par[0]--; gotoxay(vc, vc->state.x ,vc->vc_par[0]); break; case 'H': case 'f': if (vc->vc_par[0]) vc->vc_par[0]--; if (vc->vc_par[1]) vc->vc_par[1]--; gotoxay(vc, vc->vc_par[1], vc->vc_par[0]); break; case 'J': csi_J(vc, vc->vc_par[0]); break; case 'K': csi_K(vc); break; case 'L': csi_L(vc); break; case 'M': csi_M(vc); break; case 'P': csi_P(vc); break; case 'c': if (!vc->vc_par[0]) respond_ID(tty); break; case 'g': if (!vc->vc_par[0] && vc->state.x < VC_TABSTOPS_COUNT) set_bit(vc->state.x, vc->vc_tab_stop); else if (vc->vc_par[0] == 3) bitmap_zero(vc->vc_tab_stop, VC_TABSTOPS_COUNT); break; case 'h': csi_hl(vc, true); break; case 'l': csi_hl(vc, false); break; case 'm': csi_m(vc); break; case 'n': if (vc->vc_par[0] == 5) status_report(tty); else if (vc->vc_par[0] == 6) cursor_report(vc, tty); break; case 'q': /* DECLL - but only 3 leds */ /* map 0,1,2,3 to 0,1,2,4 */ if (vc->vc_par[0] < 4) vt_set_led_state(vc->vc_num, (vc->vc_par[0] < 3) ? vc->vc_par[0] : 4); break; case 'r': if (!vc->vc_par[0]) vc->vc_par[0]++; if (!vc->vc_par[1]) vc->vc_par[1] = vc->vc_rows; /* Minimum allowed region is 2 lines */ if (vc->vc_par[0] < vc->vc_par[1] && vc->vc_par[1] <= vc->vc_rows) { vc->vc_top = vc->vc_par[0] - 1; vc->vc_bottom = vc->vc_par[1]; gotoxay(vc, 0, 0); } break; case 's': save_cur(vc); break; case 'u': restore_cur(vc); break; case 'X': csi_X(vc); break; case '@': csi_at(vc, vc->vc_par[0]); break; case ']': csi_RSB(vc); break; } } static void vc_reset_params(struct vc_data *vc) { memset(vc->vc_par, 0, sizeof(vc->vc_par)); vc->vc_npar = 0; } /* console_lock is held */ static void do_con_trol(struct tty_struct *tty, struct vc_data *vc, u8 c) { /* * Control characters can be used in the _middle_ * of an escape sequence, aside from ANSI control strings. */ if (ansi_control_string(vc->vc_state) && c >= ASCII_IGNORE_FIRST && c <= ASCII_IGNORE_LAST) return; if (handle_ascii(tty, vc, c)) return; switch(vc->vc_state) { case ESesc: /* ESC */ handle_esc(tty, vc, c); return; case ESnonstd: /* ESC ] aka OSC */ switch (c) { case 'P': /* palette escape sequence */ vc_reset_params(vc); vc->vc_state = ESpalette; return; case 'R': /* reset palette */ reset_palette(vc); break; case '0' ... '9': vc->vc_state = ESosc; return; } vc->vc_state = ESnormal; return; case ESpalette: /* ESC ] P aka OSC P */ if (isxdigit(c)) { vc->vc_par[vc->vc_npar++] = hex_to_bin(c); if (vc->vc_npar == 7) { int i = vc->vc_par[0] * 3, j = 1; vc->vc_palette[i] = 16 * vc->vc_par[j++]; vc->vc_palette[i++] += vc->vc_par[j++]; vc->vc_palette[i] = 16 * vc->vc_par[j++]; vc->vc_palette[i++] += vc->vc_par[j++]; vc->vc_palette[i] = 16 * vc->vc_par[j++]; vc->vc_palette[i] += vc->vc_par[j]; set_palette(vc); vc->vc_state = ESnormal; } } else vc->vc_state = ESnormal; return; case ESsquare: /* ESC [ aka CSI, parameters or modifiers expected */ vc_reset_params(vc); vc->vc_state = ESgetpars; switch (c) { case '[': /* Function key */ vc->vc_state = ESfunckey; return; case '?': vc->vc_priv = EPdec; return; case '>': vc->vc_priv = EPgt; return; case '=': vc->vc_priv = EPeq; return; case '<': vc->vc_priv = EPlt; return; } vc->vc_priv = EPecma; fallthrough; case ESgetpars: /* ESC [ aka CSI, parameters expected */ switch (c) { case ';': if (vc->vc_npar < NPAR - 1) { vc->vc_npar++; return; } break; case '0' ... '9': vc->vc_par[vc->vc_npar] *= 10; vc->vc_par[vc->vc_npar] += c - '0'; return; } if (c >= ASCII_CSI_IGNORE_FIRST && c <= ASCII_CSI_IGNORE_LAST) { vc->vc_state = EScsiignore; return; } /* parameters done, handle the control char @c */ vc->vc_state = ESnormal; switch (vc->vc_priv) { case EPdec: csi_DEC(tty, vc, c); return; case EPecma: csi_ECMA(tty, vc, c); return; default: return; } case EScsiignore: if (c >= ASCII_CSI_IGNORE_FIRST && c <= ASCII_CSI_IGNORE_LAST) return; vc->vc_state = ESnormal; return; case ESpercent: /* ESC % */ vc->vc_state = ESnormal; switch (c) { case '@': /* defined in ISO 2022 */ vc->vc_utf = 0; return; case 'G': /* prelim official escape code */ case '8': /* retained for compatibility */ vc->vc_utf = 1; return; } return; case ESfunckey: /* ESC [ [ aka CSI [ */ vc->vc_state = ESnormal; return; case EShash: /* ESC # */ vc->vc_state = ESnormal; if (c == '8') { /* DEC screen alignment test. kludge :-) */ vc->vc_video_erase_char = (vc->vc_video_erase_char & 0xff00) | 'E'; csi_J(vc, CSI_J_VISIBLE); vc->vc_video_erase_char = (vc->vc_video_erase_char & 0xff00) | ' '; do_update_region(vc, vc->vc_origin, vc->vc_screenbuf_size / 2); } return; case ESsetG0: /* ESC ( */ vc_setGx(vc, 0, c); vc->vc_state = ESnormal; return; case ESsetG1: /* ESC ) */ vc_setGx(vc, 1, c); vc->vc_state = ESnormal; return; case ESapc: /* ESC _ */ return; case ESosc: /* ESC ] [0-9] aka OSC [0-9] */ return; case ESpm: /* ESC ^ */ return; case ESdcs: /* ESC P */ return; default: vc->vc_state = ESnormal; } } struct vc_draw_region { unsigned long from, to; int x; }; static void con_flush(struct vc_data *vc, struct vc_draw_region *draw) { if (draw->x < 0) return; vc->vc_sw->con_putcs(vc, (u16 *)draw->from, (u16 *)draw->to - (u16 *)draw->from, vc->state.y, draw->x); draw->x = -1; } static inline int vc_translate_ascii(const struct vc_data *vc, int c) { if (IS_ENABLED(CONFIG_CONSOLE_TRANSLATIONS)) { if (vc->vc_toggle_meta) c |= 0x80; return vc->vc_translate[c]; } return c; } /** * vc_sanitize_unicode - Replace invalid Unicode code points with ``U+FFFD`` * @c: the received code point */ static inline int vc_sanitize_unicode(const int c) { if (c >= 0xd800 && c <= 0xdfff) return 0xfffd; return c; } /** * vc_translate_unicode - Combine UTF-8 into Unicode in &vc_data.vc_utf_char * @vc: virtual console * @c: UTF-8 byte to translate * @rescan: set to true iff @c wasn't consumed here and needs to be re-processed * * * &vc_data.vc_utf_char is the being-constructed Unicode code point. * * &vc_data.vc_utf_count is the number of continuation bytes still expected to * arrive. * * &vc_data.vc_npar is the number of continuation bytes arrived so far. * * Return: * * %-1 - Input OK so far, @c consumed, further bytes expected. * * %0xFFFD - Possibility 1: input invalid, @c may have been consumed (see * desc. of @rescan). Possibility 2: input OK, @c consumed, * ``U+FFFD`` is the resulting code point. ``U+FFFD`` is valid, * ``REPLACEMENT CHARACTER``. * * otherwise - Input OK, @c consumed, resulting code point returned. */ static int vc_translate_unicode(struct vc_data *vc, int c, bool *rescan) { static const u32 utf8_length_changes[] = {0x7f, 0x7ff, 0xffff, 0x10ffff}; /* Continuation byte received */ if ((c & 0xc0) == 0x80) { /* Unexpected continuation byte? */ if (!vc->vc_utf_count) goto bad_sequence; vc->vc_utf_char = (vc->vc_utf_char << 6) | (c & 0x3f); vc->vc_npar++; if (--vc->vc_utf_count) goto need_more_bytes; /* Got a whole character */ c = vc->vc_utf_char; /* Reject overlong sequences */ if (c <= utf8_length_changes[vc->vc_npar - 1] || c > utf8_length_changes[vc->vc_npar]) goto bad_sequence; return vc_sanitize_unicode(c); } /* Single ASCII byte or first byte of a sequence received */ if (vc->vc_utf_count) { /* A continuation byte was expected */ *rescan = true; vc->vc_utf_count = 0; goto bad_sequence; } /* Nothing to do if an ASCII byte was received */ if (c <= 0x7f) return c; /* First byte of a multibyte sequence received */ vc->vc_npar = 0; if ((c & 0xe0) == 0xc0) { vc->vc_utf_count = 1; vc->vc_utf_char = (c & 0x1f); } else if ((c & 0xf0) == 0xe0) { vc->vc_utf_count = 2; vc->vc_utf_char = (c & 0x0f); } else if ((c & 0xf8) == 0xf0) { vc->vc_utf_count = 3; vc->vc_utf_char = (c & 0x07); } else { goto bad_sequence; } need_more_bytes: return -1; bad_sequence: return 0xfffd; } static int vc_translate(struct vc_data *vc, int *c, bool *rescan) { /* Do no translation at all in control states */ if (vc->vc_state != ESnormal) return *c; if (vc->vc_utf && !vc->vc_disp_ctrl) return *c = vc_translate_unicode(vc, *c, rescan); /* no utf or alternate charset mode */ return vc_translate_ascii(vc, *c); } static inline unsigned char vc_invert_attr(const struct vc_data *vc) { if (!vc->vc_can_do_color) return vc->vc_attr ^ 0x08; if (vc->vc_hi_font_mask == 0x100) return (vc->vc_attr & 0x11) | ((vc->vc_attr & 0xe0) >> 4) | ((vc->vc_attr & 0x0e) << 4); return (vc->vc_attr & 0x88) | ((vc->vc_attr & 0x70) >> 4) | ((vc->vc_attr & 0x07) << 4); } static bool vc_is_control(struct vc_data *vc, int tc, int c) { /* * A bitmap for codes <32. A bit of 1 indicates that the code * corresponding to that bit number invokes some special action (such * as cursor movement) and should not be displayed as a glyph unless * the disp_ctrl mode is explicitly enabled. */ static const u32 CTRL_ACTION = BIT(ASCII_NULL) | GENMASK(ASCII_SHIFTIN, ASCII_BELL) | BIT(ASCII_CANCEL) | BIT(ASCII_SUBSTITUTE) | BIT(ASCII_ESCAPE); /* Cannot be overridden by disp_ctrl */ static const u32 CTRL_ALWAYS = BIT(ASCII_NULL) | BIT(ASCII_BACKSPACE) | BIT(ASCII_LINEFEED) | BIT(ASCII_SHIFTIN) | BIT(ASCII_SHIFTOUT) | BIT(ASCII_CAR_RET) | BIT(ASCII_FORMFEED) | BIT(ASCII_ESCAPE); if (vc->vc_state != ESnormal) return true; if (!tc) return true; /* * If the original code was a control character we only allow a glyph * to be displayed if the code is not normally used (such as for cursor * movement) or if the disp_ctrl mode has been explicitly enabled. * Certain characters (as given by the CTRL_ALWAYS bitmap) are always * displayed as control characters, as the console would be pretty * useless without them; to display an arbitrary font position use the * direct-to-font zone in UTF-8 mode. */ if (c < BITS_PER_TYPE(CTRL_ALWAYS)) { if (vc->vc_disp_ctrl) return CTRL_ALWAYS & BIT(c); else return vc->vc_utf || (CTRL_ACTION & BIT(c)); } if (c == ASCII_DEL && !vc->vc_disp_ctrl) return true; if (c == ASCII_EXT_CSI) return true; return false; } static void vc_con_rewind(struct vc_data *vc) { if (vc->state.x && !vc->vc_need_wrap) { vc->vc_pos -= 2; vc->state.x--; } vc->vc_need_wrap = 0; } #define UCS_ZWS 0x200b /* Zero Width Space */ #define UCS_VS16 0xfe0f /* Variation Selector 16 */ #define UCS_REPLACEMENT 0xfffd /* Replacement Character */ static int vc_process_ucs(struct vc_data *vc, int *c, int *tc) { u32 prev_c, curr_c = *c; if (ucs_is_double_width(curr_c)) { /* * The Unicode screen memory is allocated only when * required. This is one such case as we need to remember * which displayed characters are double-width. */ vc_uniscr_check(vc); return 2; } if (!ucs_is_zero_width(curr_c)) return 1; /* From here curr_c is known to be zero-width. */ if (ucs_is_double_width(vc_uniscr_getc(vc, -2))) { /* * Let's merge this zero-width code point with the preceding * double-width code point by replacing the existing * zero-width space padding. To do so we rewind one column * and pretend this has a width of 1. * We give the legacy display the same initial space padding. */ vc_con_rewind(vc); *tc = ' '; return 1; } /* From here the preceding character, if any, must be single-width. */ prev_c = vc_uniscr_getc(vc, -1); if (curr_c == UCS_VS16 && prev_c != 0) { /* * VS16 (U+FE0F) is special. It typically turns the preceding * single-width character into a double-width one. Let it * have a width of 1 effectively making the combination with * the preceding character double-width. */ *tc = ' '; return 1; } /* try recomposition */ prev_c = ucs_recompose(prev_c, curr_c); if (prev_c != 0) { vc_con_rewind(vc); *tc = *c = prev_c; return 1; } /* Otherwise zero-width code points are ignored. */ return 0; } static int vc_get_glyph(struct vc_data *vc, int tc) { int glyph = conv_uni_to_pc(vc, tc); u16 charmask = vc->vc_hi_font_mask ? 0x1ff : 0xff; if (!(glyph & ~charmask)) return glyph; if (glyph == -1) return -1; /* nothing to display */ /* Glyph not found */ if ((!vc->vc_utf || vc->vc_disp_ctrl || tc < 128) && !(tc & ~charmask)) { /* * In legacy mode use the glyph we get by a 1:1 mapping. * This would make absolutely no sense with Unicode in mind, but do this for * ASCII characters since a font may lack Unicode mapping info and we don't * want to end up with having question marks only. */ return tc; } /* * The Unicode screen memory is allocated only when required. * This is one such case: we're about to "cheat" with the displayed * character meaning the simple screen buffer won't hold the original * information, whereas the Unicode screen buffer always does. */ vc_uniscr_check(vc); /* Try getting a simpler fallback character. */ tc = ucs_get_fallback(tc); if (tc) return vc_get_glyph(vc, tc); /* Display U+FFFD (Unicode Replacement Character). */ return conv_uni_to_pc(vc, UCS_REPLACEMENT); } static int vc_con_write_normal(struct vc_data *vc, int tc, int c, struct vc_draw_region *draw) { int next_c; unsigned char vc_attr = vc->vc_attr; u16 himask = vc->vc_hi_font_mask; u8 width = 1; bool inverse = false; if (vc->vc_utf && !vc->vc_disp_ctrl) { width = vc_process_ucs(vc, &c, &tc); if (!width) goto out; } /* Now try to find out how to display it */ tc = vc_get_glyph(vc, tc); if (tc == -1) return -1; /* nothing to display */ if (tc < 0) { inverse = true; tc = conv_uni_to_pc(vc, '?'); if (tc < 0) tc = '?'; vc_attr = vc_invert_attr(vc); con_flush(vc, draw); } next_c = c; while (1) { if (vc->vc_need_wrap || vc->vc_decim) con_flush(vc, draw); if (vc->vc_need_wrap) { cr(vc); lf(vc); } if (vc->vc_decim) insert_char(vc, 1); vc_uniscr_putc(vc, next_c); if (himask) tc = ((tc & 0x100) ? himask : 0) | (tc & 0xff); tc |= (vc_attr << 8) & ~himask; scr_writew(tc, (u16 *)vc->vc_pos); if (con_should_update(vc) && draw->x < 0) { draw->x = vc->state.x; draw->from = vc->vc_pos; } if (vc->state.x == vc->vc_cols - 1) { vc->vc_need_wrap = vc->vc_decawm; draw->to = vc->vc_pos + 2; } else { vc->state.x++; draw->to = (vc->vc_pos += 2); } if (!--width) break; /* A space is printed in the second column */ tc = conv_uni_to_pc(vc, ' '); if (tc < 0) tc = ' '; /* * Store a zero-width space in the Unicode screen given that * the previous code point is semantically double width. */ next_c = UCS_ZWS; } out: notify_write(vc, c); if (inverse) con_flush(vc, draw); return 0; } /* acquires console_lock */ static int do_con_write(struct tty_struct *tty, const u8 *buf, int count) { struct vc_draw_region draw = { .x = -1, }; int c, tc, n = 0; unsigned int currcons; struct vc_data *vc = tty->driver_data; struct vt_notifier_param param; bool rescan; if (in_interrupt()) return count; console_lock(); currcons = vc->vc_num; if (!vc_cons_allocated(currcons)) { /* could this happen? */ pr_warn_once("con_write: tty %d not allocated\n", currcons+1); console_unlock(); return 0; } /* undraw cursor first */ if (con_is_fg(vc)) hide_cursor(vc); param.vc = vc; while (!tty->flow.stopped && count) { u8 orig = *buf; buf++; n++; count--; rescan_last_byte: c = orig; rescan = false; tc = vc_translate(vc, &c, &rescan); if (tc == -1) continue; param.c = tc; if (atomic_notifier_call_chain(&vt_notifier_list, VT_PREWRITE, ¶m) == NOTIFY_STOP) continue; if (vc_is_control(vc, tc, c)) { con_flush(vc, &draw); do_con_trol(tty, vc, orig); continue; } if (vc_con_write_normal(vc, tc, c, &draw) < 0) continue; if (rescan) goto rescan_last_byte; } con_flush(vc, &draw); console_conditional_schedule(); notify_update(vc); console_unlock(); return n; } /* * This is the console switching callback. * * Doing console switching in a process context allows * us to do the switches asynchronously (needed when we want * to switch due to a keyboard interrupt). Synchronization * with other console code and prevention of re-entrancy is * ensured with console_lock. */ static void console_callback(struct work_struct *ignored) { console_lock(); if (want_console >= 0) { if (want_console != fg_console && vc_cons_allocated(want_console)) { hide_cursor(vc_cons[fg_console].d); change_console(vc_cons[want_console].d); /* we only changed when the console had already been allocated - a new console is not created in an interrupt routine */ } want_console = -1; } if (do_poke_blanked_console) { /* do not unblank for a LED change */ do_poke_blanked_console = 0; poke_blanked_console(); } if (scrollback_delta) { struct vc_data *vc = vc_cons[fg_console].d; clear_selection(); if (vc->vc_mode == KD_TEXT && vc->vc_sw->con_scrolldelta) vc->vc_sw->con_scrolldelta(vc, scrollback_delta); scrollback_delta = 0; } if (blank_timer_expired) { do_blank_screen(0); blank_timer_expired = 0; } notify_update(vc_cons[fg_console].d); console_unlock(); } int set_console(int nr) { struct vc_data *vc = vc_cons[fg_console].d; if (!vc_cons_allocated(nr) || vt_dont_switch || (vc->vt_mode.mode == VT_AUTO && vc->vc_mode == KD_GRAPHICS)) { /* * Console switch will fail in console_callback() or * change_console() so there is no point scheduling * the callback * * Existing set_console() users don't check the return * value so this shouldn't break anything */ return -EINVAL; } want_console = nr; schedule_console_callback(); return 0; } struct tty_driver *console_driver; #ifdef CONFIG_VT_CONSOLE /** * vt_kmsg_redirect() - sets/gets the kernel message console * @new: the new virtual terminal number or -1 if the console should stay * unchanged * * By default, the kernel messages are always printed on the current virtual * console. However, the user may modify that default with the * %TIOCL_SETKMSGREDIRECT ioctl call. * * This function sets the kernel message console to be @new. It returns the old * virtual console number. The virtual terminal number %0 (both as parameter and * return value) means no redirection (i.e. always printed on the currently * active console). * * The parameter -1 means that only the current console is returned, but the * value is not modified. You may use the macro vt_get_kmsg_redirect() in that * case to make the code more understandable. * * When the kernel is compiled without %CONFIG_VT_CONSOLE, this function ignores * the parameter and always returns %0. */ int vt_kmsg_redirect(int new) { static int kmsg_con; if (new != -1) return xchg(&kmsg_con, new); else return kmsg_con; } /* * Console on virtual terminal * * The console must be locked when we get here. */ static void vt_console_print(struct console *co, const char *b, unsigned count) { struct vc_data *vc = vc_cons[fg_console].d; unsigned char c; static DEFINE_SPINLOCK(printing_lock); const ushort *start; ushort start_x, cnt; int kmsg_console; WARN_CONSOLE_UNLOCKED(); /* this protects against concurrent oops only */ if (!spin_trylock(&printing_lock)) return; kmsg_console = vt_get_kmsg_redirect(); if (kmsg_console && vc_cons_allocated(kmsg_console - 1)) vc = vc_cons[kmsg_console - 1].d; if (!vc_cons_allocated(fg_console)) { /* impossible */ /* printk("vt_console_print: tty %d not allocated ??\n", currcons+1); */ goto quit; } if (vc->vc_mode != KD_TEXT) goto quit; /* undraw cursor first */ if (con_is_fg(vc)) hide_cursor(vc); start = (ushort *)vc->vc_pos; start_x = vc->state.x; cnt = 0; while (count--) { c = *b++; if (c == ASCII_LINEFEED || c == ASCII_CAR_RET || c == ASCII_BACKSPACE || vc->vc_need_wrap) { if (cnt && con_is_visible(vc)) vc->vc_sw->con_putcs(vc, start, cnt, vc->state.y, start_x); cnt = 0; if (c == ASCII_BACKSPACE) { bs(vc); start = (ushort *)vc->vc_pos; start_x = vc->state.x; continue; } if (c != ASCII_CAR_RET) lf(vc); cr(vc); start = (ushort *)vc->vc_pos; start_x = vc->state.x; if (c == ASCII_LINEFEED || c == ASCII_CAR_RET) continue; } vc_uniscr_putc(vc, c); scr_writew((vc->vc_attr << 8) + c, (u |